Big Data Archives - Page 8 of 11 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Python

3 Tips for Unit Testing PySpark Pipelines

I’m not sure what it is, but some prevailing evil in the Data Engineering world has made it not so common for PySpark pipelines to be unit tested. Who knows, it’s probably a combination of things. Data Engineers have been accused of not having good Software Engineering principles. Functional testing is a hot commodity in the Software Engineering world but probably takes a while to trickle its way into mainstream Data Engineering. It can require good Docker skills. Also, generally speaking, the old school Data and ETL Developers that preceded Data Engineers in the bygone days never unit tested …. so neither do their ancestors.

Who knows? All that being said I want to give you 3 tips to help you unit test your PySpark ETL data pipelines.

November 9, 2021

Big Data, Data, Data Engineering, Data Warehousing

6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can cost you time and money very easily when running large ETL pipelines on Databricks.

I want to share 7 tips for Databricks newbies, and oldies, that are foundational to good Data Engineering architecture, affecting both performance and cost.

October 29, 2021

Big Data, Data, Data Engineering, Data Warehousing

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Data Modeling is a topic that never goes away. Sometimes I do reminisce about the good ol’ days of Kimball-style data models, it was so simple, straightforward, just the same thing for years. Then Big Data happened, Spark happened. Things just changed. There is a lot of new content coming out around Data Lakes and data modeling, but it still seems like a fluid topic, with nothing as concrete as the classic Data Warehouse toolkit.

Oh, what to do what to do. I do believe there are a few key ideas and points to being successful with file-based Data Lake modeling. I think it’s a mistake to fully embrace the classic Kimball-style Data Warehouse approach. It really comes down to Relational Database SQL vs File-Based data models are going to be different, for technical and practical reasons.

October 20, 2021

Big Data, Data, Data Engineering, Python, Ramblings

Review of Airbyte for Data Engineers

It’s hard to keep up with the never-ending stream of new Data Engineering tools these days. Always something new around the next bend. I find it interesting to kick the tries on the new kids on the block. It’s always interesting to see what angle or pain point a new tool tries to hone in on. I mean if you think about Data Engineering in general, the fundamentals really haven’t changed that much over the years, the tools change, but what we do hasn’t. We are expected to move data from point A to point B in a reliable, scalable, and efficient manner.

Today I’m going to be reviewing a tool called Airbyte. When I review a new product I’m usually incredibly basic about what I look for and I try to answer some easy and obvious questions. How easy is it to set up and use? What does the documentation look like? When I run into a problem can I solve it? Is the overhead of adding this new tool to a tech stack worth what features it offers? This is how we will explore Airbyte.

October 13, 2021

Big Data, Data, Data Engineering, Python

Bitwise Operations for Data Engineers

Ugh. Cursed bitwise operations … something usually reserved for the low-level mythical engineers writing code no one should have to write. I’ve escaped all but twice during my meager existence, recently I had to use a bitwise operation while converting a Python hashing algorithm into PySpark code. It made my brain hurt. What is this wizardry all about anyways? It got me thinking, I should really attempt to learn something about bitwise operations since it comes up once every 10 years.

October 13, 2021

Big Data, Data, Data Engineering, Python, Ramblings

The Wild West of Parallel Computing – Review of Bodo.ai

It truly is the Wild West of parallel computing these days. It seems that big data has brought out an onslaught of companies trying to either take advantage of making it easier to use any number of big data platforms or making up their own. Most of them usually take shots at tools like Spark and Dask, probably two of the more well-known big data engines. Of course with Python’s rise, especially in Data Science and ML, many of these tools target that audience.

One such newcomer is Bodo.ai, and I’ve seen them pop up on places like r/dataengineering. Fortunately, they have a free community edition, so let’s kick the tires and see what’s going on.

September 24, 2021

Big Data, Data, Data Engineering, Machine Learning, Python

Dask vs PySpark – Performance and Other Thoughts.

Every once in awhile I see someone talking about their wonder distributed cluster of Dask machines, and my curiosity gets aroused. I know plenty of people use Dask, mostly on their local machines, but it seems like the meteoric rise of Spark, especially with tools like EMR and Databricks, that Dask is slowly slipping into the shadows. I’ve had bad experiences with Dask in the past, trying to get it work well in production. I suppose that comes from working with tried and true Spark and other bullet proof distributed system. I’ve been meaning to return to Dask for awhile, compare a similar Dask and Spark cluster on performance … and other things like ease of setup and writing code. Let’s get too it.

September 6, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs AWS EMR – Theory and Real Life.

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for every technology, old or new … like our favorite band or sports team, fight to the death for what we love and cherish. I want to talk theoretically, and technically about Databricks and EMR, and why you should use Databricks. 🙂

August 19, 2021

Big Data, Data, Data Engineering, Python

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Sometimes I amaze myself. I’ve been using PySpark for a few years now, happily crunching hundreds of TBs of data without much problem. Sure you randomly run into OOM errors and other such nonsense. Usually inspecting the code for something silly, throwing in a persist() or cache() here and there will solve 99% of the problems. I’ve always approached Spark performance with an overly pragmatic approach. Spark being the beast that it is, it’s easy to hide performance problems with more resources etc. I’ve generally tried to stay away from UDF's just using good coding practices and out of the box functionality. Ensuring good predicate pushdown’s, data partitioning etc are all helpful and important. But in the end… I don’t really know much about the out-of-the-box Spark configurations and how they affect performance.

Do the configurations change based on data size and partitioning strategy plus resources and cluster size? Probably. Does that seem complicated to figure out? Yes. Is the internet full of conflicting, vague and confusing advice? Of course.

August 13, 2021

Big Data, Data, Data Engineering, Ramblings

Build your Data Engineering skills with Open Source Data

There are many a day when I find myself scrolling through the subreddit for r/dataengineerg, it’s a fun place to stalk. Lot’s of people with lots of opinions make for interesting times. I see one question or a variation of it come up over and over again. How do I learn data engineering skills, how do I get into data engineering, what kind of problems do data engineers solve, blah, blah, blah? It’s a great question, and one without an easy answer. Well … there is an answer but it takes some time and willpower to get it done. Open source data. This is the way. Read books, take classes, do whatever, it’s hard to really learn the skills needed day-to-day as a data engineer without actually doing the work. But how do you do the work without the work? Make up your own work I say.

August 1, 2021

3 Tips for Unit Testing PySpark Pipelines

6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Review of Airbyte for Data Engineers

Bitwise Operations for Data Engineers

The Wild West of Parallel Computing – Review of Bodo.ai

Dask vs PySpark – Performance and Other Thoughts.

Databricks vs AWS EMR – Theory and Real Life.

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Build your Data Engineering skills with Open Source Data

Interesting links

Pages

Categories

Archive