Anyone who’s been roaming around the forest of Data Engineering has probably run into many of the newish tools that have been growing rapidly around the concepts of Data Warehouses, Data Lakes, and Lake Houses … the merging of the old relational database functionality with TB and PB level cloud-based file storage systems. Tools like Delta Lake, lakeFS, Hudi, and the like.

Sure, these tools have been around for some time, but the uptake and adoption of them all have been rapidly growing. I use Delta Lake on a daily basis, taking advantage of the many wonderful features it provides to simplify and reduce complexity in data pipelines. But, I’ve been sitting around for a long time waiting for the plethora of “add-on” tooling to come out, stuff that will make my life easier. I recently saw one of the first tools like that for Delta Lake, namely mack.

Mack appears to have the ability to “do the hard work for you,” a concept that appears to be growing in popularity, but which I have a fraught relationship with. Double-edged sword? Let’s find out.

Read more

Photo by Francesco Alberti on Unsplash

Nothing captures the imagination and heart like a tale of betrayal and heartbreak, and that is a tale I want to bring to you today. It’s a tale of Databricks Workflows and Jobs, version changes, new features, API’s, and insidious little hidden gems that will make you pull your hair out when you find them. It’s a tale of what not to do, a tale of how to put developer and customer experience first, instead of forcing unwanted solutions down the throats of the little birdies feeding at your nest.

As a Data Engineering simplicity and ease of use is something close to my heart, something that Databricks did well, or maybe I should say used to do well … before recent releases like Jobs 2.1 API. I hope you can hear the bitterness oozing from my words.

Read more

There are probably few things in life that will strike more fear and tumult in the heart of the Data Engineer than historical loads. You know, on the surface it seems like such an innocent thing. How could it possibly be, just take a bunch of data stored somewhere and shove it into a table. If only. Life never works that way, and neither does the historical load. You would think after decades we all would have figured it out you know. Is it because we don’t do it enough? Maybe it’s like regex, you just figure it out as you go every single time, telling yourself you’ll do it right next time.

Read more
Photo by krakenimages on Unsplash

The intersection of Big Data and Not Big Data.

An interesting topic of late that has been rattling around in my overcrowded head is the idea of Big Data vs Not Big Data, and the intersection thereof. I’ve been thinking about SAAS vendors, the Modern Data Stack, costs, and innovation. A great real-life example of all these topics is Delta Lake. Delta Lake is the child of Databricks, officially or not, and at a minimum has exploded in usage because of the increasing usage of Databricks and the popularity of Data Lakes.

Delta Lake, Hudi, Iceberg, all these ACID/CRUD abstractions on top of storage for Big Data have been game changers. But, as with any new popular tech, it comes with its own set of challenges. Specifically for Delta Lake … if you want to use it 99.9% of people are going to have to use Spark to do so, which can be costly, in terms of running clusters, and add complexity, in terms of new tooling, data pipelines, and the like. Anytime you only have one path to take with a tool, innovation is stifled, and barriers arise. Enter delta-rs the Standalone Rust API for Delta.

Read more
Photo by Aziz Acharki on Unsplash

Do this, not that. Well, I’ve got my own list. With everyone jumping on the PySpark / Databricks / EMR / Glue / Whatever bandwagon I thought it was long overdue for a post on what to do, and not to do when working with Spark / PySpark. I take the pragmatic approach to working with Spark, it’s honestly very forgiving well and far into the 10s of TBs of data. Once you wander past that point things tend to get a little spicy if you don’t have it all dialed in. As with most things in life if you get a few things right, and of course don’t do some things, that will get you a long way, the same applies to Spark.

Read more
Photo by Josh Rakower on Unsplash

So, you’ve heard about dbt have you. I honestly can’t decide if it’s here to stay or not, probably is, enough folks are using it, and preaching about it. I personally have always been a little skeptical of dbt, not because it can’t do what it says it can do, it can, but because I’m old and bitter from my many years of Data Engineering, and I always see the problems in things.

But, I will let you judge that for yourself. Today I want to give a brief overview of dbt, kick the tires, muse about its features, and most importantly, look at dbt from a Data Engineering perspective, ferret out the good, the bad, and the ugly. I will try my best to be nice but don’t count on it. Code is on GitHub.

Read more
Photo by Tim Schmidbauer on Unsplash

Ever since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…

Data reliability testing for SQL- and Spark- accesssible data.

soda-core docs

Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.

Read more
Photo by Benjamin Wedemeyer on Unsplash

I think it’s funny that DataFrames are so popular these days, I mean for good reason. They are a wonderful and intuitive way to work with and on datasets. Pandas … the nemesis of all Data Engineers and the lover of Data Scientists. Apache Spark is really the beast that brought DataFrames to the masses. Even those little buggers over at Apache Beam give you DataFrames.

Of course, when anything gets popular, you start getting little things that start to pick and peck at the heels. I would probably say that is what DataFusion with Rust seems to be. Seems more like a contender against Pandas rather than Spark to me. I guess if you’re just using Spark locally or on a single node, sure you could consider using DataFusion. Code available on GitHub.

Read more
Photo by Saad Chaudhry on Unsplash

How many times in your life, that is but a mist, have you thought, “If I had only known that in the beginning?” I feel as if I’ve committed that cardinal sin as a developer and Data Engineer … falling in love with a tool to the exclusion of all else. I mean truly, Databricks has brought Big Data to the masses, all you need is your laptop and 10 minutes of PySpark training before your spending gobs of money, processing massive amounts of data. Where else, and with what else can you do such things? Try it with EMR, good luck to you.

That being said, when you love something you start to notice the slight imperfections and problems with that something. You get kinda nit-picky. Such is life. I want to save some poor soul out there some heartache, that moment when you’ve been writing code for hours or days, and come upon a little surprise that makes your heart drop into your shoes, and the blood runs to your face. Here are 10 things I wish I knew about Databricks before I started. Maybe it will save you time, help you, who knows.

Read more
Photo by Joel Holland on Unsplash

I think Delta Lake is here to stay. With the recent news that Databricks is open-sourcing the full feature-set of Delta Lake, instead of keeping the best stuff for themselves, it probably has the most potential to be the number one go-to for the future of Data Lakes, especially within those organizations that are heavy Spark users.

One of the best parts about Delta Lake is that it’s easy to use, yet it has a rich feature set, making it a powerful option for Big Data storage and modeling. One of those features that promise a lot of performance benefits is something called ZORDER. Today I want to explore more in-depth what ZORDER is, when to use it, when not to use it, and most importantly test its performance during a number of common Spark operations.

Read more