There are few things in life that are worse then cracking open some serious PySpark pipeline code, and then realizing there isn’t a single function written to encapsulate logic … wondering if some change you are about to make will bring down the whole pipeline. When you are new to a codebase you don’t know […]

It seems like today the problems and challenges of Data Engineering are being solved at a lightning pace. New technologies are coming out all the time that seem to make life a little easier (or harder) while solving age old problems. I feel like Machine Learning Ops (MLOps) is not one of those things. It’s […]

If you’re anything like me when someone says Delta Lake you think DataBricks. But, the mythical Delta Lake is an open source project, available to anyone running Apache Spark. It seems also too good to be true, ACID transactions on the Spark scale? Incredible. This is the future, it has to be. The lines of […]

Some poor Data Engineer is sweating and typing away in a dark closet … moving data, solving bugs, just trying to get through the day. Why should the ‘ole Data Engineer care about the huff-a-luff around the billion dollar series recently done by DataBricks? I mean what possible reverence could it have on the day […]

I am going to peer into the crystal-ball, the seeing stone, looking into the murky future of Data Engineering to see what mysteries it holds. I’ve seen a story, a tale of two Data Warehouses, I’ve seen Machine Learning, Streams, Distributed Systems, Storage, the eternal SQL. A lot has changed in the world of Data […]

In part 1 of the big data file formats we reviewed Parquet vs Avro. It was apparent from the start that the two file formats were built for different things. Avro is clearly a complex row structured file format used in communication and transactions, where schema is king and nested structures are no problem. Parquet […]

Don’t you like stuff for free? Don’t you like it when stuff I just handed to you? I mean when is that last time you didn’t want to get a free t-shirt. How about 20 bucks in the mail from you Grandma? That’s kinda what Pipelines are in Spark ML. The Apache Spark ML library […]

With parquet taking over the big data world, as it should, and csv files being that third wheel that just will never go away…. it’s becoming more and more common to see the repetitive task of converting csv files into parquets. There are lots of reasons to do this, compression, fast reads, integrations with tools […]

I’ve always been surprised with the rise of data engineering and big data, how hard it is to find good data engineering content that is somewhat regular. Tech moves fast and I feel like data engineering moves even faster. There are always new tools and systems coming out with regular frequency, it’s hard to keep […]

I’ve always been surprised at the distinct lack of most Python code I’ve seen using the map() and filter() methods as standalone functions. I’ve always found them useful and easy to use, but I don’t often come across them in the wild, I’ve even been asked to remove them from my MR/PR’s, for no other […]