We’ve all been in that spot, especially in tech. You wanted to fit in, be cool, and look smart, so you didn’t ask any questions. And now it’s too late. You’re stuck. Now you simply can’t ask … you’re too afraid. I get it. Apache Arrow is probably one of those things. It keeps popping up here and there and everywhere.

The only reason I know anything about Arrow is that some years ago, circa 2019 and earlier I stumbled into Arrow and used it to read and write Parquet files (pyarrow that is). Heck, I even used it to tie together Python and Hadoop, Lord knows what I was thinking back then. I’m amazed at how much I used PyArrow back in the day, even to compare Parquet vs Avro.

“Back then it seems like no one used Arrow much, no one was writing about it, using it, or talking about it. At least not that I saw. But oh how times have changed. Arrow seems to be showing up everywhere and is starting to become a backbone for many other tools.”

– me
Read more

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

Read more
Photo by Aziz Acharki on Unsplash

Do this, not that. Well, I’ve got my own list. With everyone jumping on the PySpark / Databricks / EMR / Glue / Whatever bandwagon I thought it was long overdue for a post on what to do, and not to do when working with Spark / PySpark. I take the pragmatic approach to working with Spark, it’s honestly very forgiving well and far into the 10s of TBs of data. Once you wander past that point things tend to get a little spicy if you don’t have it all dialed in. As with most things in life if you get a few things right, and of course don’t do some things, that will get you a long way, the same applies to Spark.

Read more
Photo by Tim Schmidbauer on Unsplash

Ever since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…

Data reliability testing for SQL- and Spark- accesssible data.

soda-core docs

Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.

Read more
Photo by José Ramos on Unsplash

I’ve always been a firm believer in using the right tool for the job. Sometimes I look at a piece of code … and ask … why? I mean just because you can do something doesn’t mean that you should. I see a lot of my job as someone who writes code … as not just my ability to write code, but the ability to reason about problems and design simple and elegant solutions that solve the problem at hand.

I try not to let my love of a tool, language, or package color my view of the world as it is. In fact, there is wisdom to be found in being critical of those languages and tools you love the most. Be aware of their shortcomings and failures. This leads to better software and architecture designs, and less complexity. Too often I’ve seen folks picking their tool of choice and then sticking with it till the bitter end, and it usually is bitter. There is more to life than writing obtuse Scala code that is illegible for some mundane task.

This sort of thing is a blight on everyone and every system. Now I must descend from my high horse and join the peasants on the dusty road of life. Today I want to look at some very common Data Engineering tasks, namely cloud storage, and what it is like to do such a thing with Golang, Rust, and Python. I will let you draw your own conclusions. Maybe. Code available on GitHub.

Read more
Photo by davisuko on Unsplash

Just when I think it cannot get more popular, it does. I have to admit, PySpark is probably the best thing that ever happened to Big Data. It made what was once a myth, approachable to the average person. No need for esoteric Java skills, no more MapReduce, just plain old Python. Another amazing thing about Spark in general, and by extension PySpark, is the sheer amount of out-of-the-box capabilities. I wanted to dedicate this post to a few amazing and wonderful features of PySpark that make Data Engineering fun and powerful.

Read more
Photo by Joshua Sortino on Unsplash

It still seems like the wild west of Data Quality these days. Tools like Apache Deque are just too much for most folks, and Data Quality is still new enough to the scene as a serious thought topic that most tools haven’t matured that much, and companies dropping money on some tool is still a little suspect. I’ve probably heard more about Great Expectations as a DQ tool than most.

With the popularity of PySpark as a Big Data tool, and Great Expectations coming into its own, I’ve been meaning to dive into what it would actually look like to to use Great Expectations at scale and answer some simple questions. How easy is it to get up and running with Spark, what’s the path of least resistance to getting some basic Data Quality checks in place in a data pipeline.

Read more

Probably one of the hardest hurdles to jump over when starting out in anything new, including Data Engineering and Data Pipelines, is knowing where to start. It always can be a little daunting. One aspect that can make or break any project, giving you the confidence to move forward like Sparticus to conquer, is having a good project template for your repository of code and logic that will encapsulate and present your code to others.

I’ve created a free and hopefully helpful Python blank GitHub project template that you can clone, change, and steal to your heart’s desire. I hope it will be helpful and set you going in the right direction for your next project.

Read more

Not going to lie, I do enjoy the vendor wars that this marketing craze called “The Modern Data Stack” has created. I like to keep just about everything in life at arm’s length. Kinda like the way you look at your crazy third cousin out of the corner of your eye at the family reunion. I mean it’s nice to have all these options to choose from these days when building data pipelines.

One tool I haven’t been able to poke the tires on yet is Prefect. It appears to be another data orchestration tool for Python, but we shall find out. I want this to be an introduction to Prefect, we shall just try it out and let the chips fall where they may.

Read more

As the years drag by in Data Engineering, there are a few things that I have come to appreciate more and more. One of those topics that is close to number one on the list is complexity reduction. Today’s modern data stacks are filled to the brim with technologies and tools, full to the brim, and overflowing. So many tools with such wonderful features, sometimes all the magic comes with a downside. Complexity. Complexity can turn something wonderful into a nightmare.

Reducing (not avoiding) complexity seems to be one of the main tenets I work on these days when designing resilient, reliable, and repeatable data pipelines that can process terabytes of data. One of those tools is COPY INTO feature of Databricks + Delta Lake.

Read more