The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines.

In the defense of many data engineers, because of the varied backgrounds people come from, some were never taught or realized the importance of good software design and testing practices. Sure, it always “takes more time” upfront to design data pipelines with code that is functional and unit-testable, and worse, able to be integration tested from end to end. It requires some foresight and thought in both data architecture and pipeline design to enable complete testability.

Integration testing end-to-end in an automated manner is a tough nut to crack. How can you do such a thing on massive pipelines that crunch hundreds of TBs of data? With a little creativity.

Read more

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I worked on a lot during the beginning of my data career. Relational databases plus Python was pretty much the name of the game.

One of the struggles I always had was how fast can I load this data into Postgres? psycopg2 was always my Python package of choice for working with Postgres, it’s a wonderful library. Today I want to give a shout-out to my old self by performance testing Python inserts into Postgres. There are about a million ways and sizes and shapes to getting a bunch of records from some CSV file, through Python, and into Postgres.

I also enjoy making people mad … there’s always that. Nothing makes people mad at you like a good ol’ performance test 🙂

Read more

Seriously, just don’t do it, they are bad for you. Listen to your mother, just say no. The dreaded ORM’s ( Object Relational Mapping ) that do all the hard SQL work for you. But, they come with many unintended consequences that are bad for your health and wellness in the long term. Many unsuspecting victims have been sucked into ORMs with the promise of an easier transition to allow programmers a familiar object-oriented design pattern for manipulating the data in a relational database, say Postgres or MySQL.

Again I tell you, don’t fall for the siren songs, there are tears and sorrow down the long and lonely ORM road.

Read more

I’m not sure what it is, but some prevailing evil in the Data Engineering world has made it not so common for PySpark pipelines to be unit tested. Who knows, it’s probably a combination of things. Data Engineers have been accused of not having good Software Engineering principles. Functional testing is a hot commodity in the Software Engineering world but probably takes a while to trickle its way into mainstream Data Engineering. It can require good Docker skills. Also, generally speaking, the old school Data and ETL Developers that preceded Data Engineers in the bygone days never unit tested …. so neither do their ancestors.

Who knows? All that being said I want to give you 3 tips to help you unit test your PySpark ETL data pipelines.

Read more

It’s hard to keep up with the never-ending stream of new Data Engineering tools these days. Always something new around the next bend. I find it interesting to kick the tries on the new kids on the block. It’s always interesting to see what angle or pain point a new tool tries to hone in on. I mean if you think about Data Engineering in general, the fundamentals really haven’t changed that much over the years, the tools change, but what we do hasn’t. We are expected to move data from point A to point B in a reliable, scalable, and efficient manner.

Today I’m going to be reviewing a tool called Airbyte. When I review a new product I’m usually incredibly basic about what I look for and I try to answer some easy and obvious questions. How easy is it to set up and use? What does the documentation look like? When I run into a problem can I solve it? Is the overhead of adding this new tool to a tech stack worth what features it offers? This is how we will explore Airbyte.

Read more

Ugh. Cursed bitwise operations … something usually reserved for the low-level mythical engineers writing code no one should have to write. I’ve escaped all but twice during my meager existence, recently I had to use a bitwise operation while converting a Python hashing algorithm into PySpark code. It made my brain hurt. What is this wizardry all about anyways? It got me thinking, I should really attempt to learn something about bitwise operations since it comes up once every 10 years.

Read more

If you’ve been around Data Engineering for a while, like me, you’ve noticed a few trends in the industry at wide, and in individual data engineers themselves. There seem to be a few types of data engineers, and it depends on where you’ve worked, and what your projects have looked like that put you here or there. Some data engineers focus on general ETL, Data Warehousing, and such things. They move data around and transform it using a myriad of tools. The other set of data engineers are more focused on infrastructure at a low level, they provide the underlying tools and services others use to make that data move around and transfer.

Which are you? One of those topics you may or may not be familiar with depending on your background is RPC or more specifically gRPC. What is it?

Read more

It truly is the Wild West of parallel computing these days. It seems that big data has brought out an onslaught of companies trying to either take advantage of making it easier to use any number of big data platforms or making up their own. Most of them usually take shots at tools like Spark and Dask, probably two of the more well-known big data engines. Of course with Python’s rise, especially in Data Science and ML, many of these tools target that audience.

One such newcomer is Bodo.ai, and I’ve seen them pop up on places like r/dataengineering. Fortunately, they have a free community edition, so let’s kick the tires and see what’s going on.

Read more

Every once in awhile I see someone talking about their wonder distributed cluster of Dask machines, and my curiosity gets aroused. I know plenty of people use Dask, mostly on their local machines, but it seems like the meteoric rise of Spark, especially with tools like EMR and Databricks, that Dask is slowly slipping into the shadows. I’ve had bad experiences with Dask in the past, trying to get it work well in production. I suppose that comes from working with tried and true Spark and other bullet proof distributed system. I’ve been meaning to return to Dask for awhile, compare a similar Dask and Spark cluster on performance … and other things like ease of setup and writing code. Let’s get too it.

Read more

Sometimes I amaze myself. I’ve been using PySpark for a few years now, happily crunching hundreds of TBs of data without much problem. Sure you randomly run into OOM errors and other such nonsense. Usually inspecting the code for something silly, throwing in a persist() or cache() here and there will solve 99% of the problems. I’ve always approached Spark performance with an overly pragmatic approach. Spark being the beast that it is, it’s easy to hide performance problems with more resources etc. I’ve generally tried to stay away from UDF's just using good coding practices and out of the box functionality. Ensuring good predicate pushdown’s, data partitioning etc are all helpful and important. But in the end… I don’t really know much about the out-of-the-box Spark configurations and how they affect performance.

Do the configurations change based on data size and partitioning strategy plus resources and cluster size? Probably. Does that seem complicated to figure out? Yes. Is the internet full of conflicting, vague and confusing advice? Of course.

Read more