Do you think I’m just trying to get you to click? Maybe. Maybe not. After working in and around Data Teams for well over a decade, with both the smartest people to touch the keyboard, and the others, it’s become quite clear to me what the number one skill that identifies a Senior level Engineering from the peons rummaging around in the StackOverflow garbage can for snippets, is.

I’m sure there will be hand-wringing, curses, tears, and generally weeping and moaning in the land, like some medieval plague that has swept away everything we hold dear. So just calm yourselves, sit down, and get your angry little fingers off that keyboard. Hear me out.

Read more

Nothing gives me greater joy than rocking the boat. I take pleasure in finding what people love most in tech and trying to poke holes in it. Everything is sacred. Nothing is sacred. I also enjoy doing simple things, things that have a “real-life” feel to them. I suppose I could be like the others and simply write boring tutorials on how to do the same old thing for the millionth time.

Ugh. No thanks.

Today I want to do something spectacularly normal. Something Data Engineers do. I’m simply going to write an AWS Lambda to process some data, one with Polars, and one with Pandas. What do I hope to accomplish?

Well, I can usually make a few people mad. AWS Architectures and fan clubs, Polars people, Pandas people, and the general public at large. Bring it.

All code on GitHub.

Read more

Sometimes it seems like the Data Engineering landscape is starting to shoot off into infinity. With the rise of Rust, new tools like DuckDB, Polars, and whatever else, things do seem to shifting at a fundamental level. It seems like there is someone at the base of a titering rock with a crowbar, picking and prying away, determined to spill tools like Java, Scala, Python, Spark, and Airflow, the things we’ve known and loved for years, from their lofty thrones.

Maybe they all have had their time in the Data Engineering sun, maybe it’s time to shake things up. It seems to be happening. It’s always hard to have those we hold dear be poked and prodded at. I’ve been using Spark since before it was cool, so when I started to hear the word Ballista start to show up here and there, I took note.

Besides, I’ve been dabbling my grubby little fingers in Rust for some months now, and have seen The Light. Is it possible I could be living at the dawn of a new era? A new and exciting frontier of Data Engineering, finally, after all this time? Could Rust really take over? Will something like Ballista pull that old Spark from its distributed processing tower and claim its rightful place?

Read more

I’ve been a dog licking my wounds for some time now. Over on my Substack newsletter, I’ve been doing a small series on DSA (Data Structures and Algorithms). I tackled some of the easier stuff first, like Linked Lists, Binary Search, and the like. What’s more, I actually did most of it in Rust, since I’ve possibly, maybe slightly, every so slightly, fallen in love with Rust.

Like most relationships, it vacillates between pure adoration and utter hatred, depending on the problem at hand. When I did a recent article on Graphs, Queues, and BSF, I attempted it in Rust, and was struck a mighty blow, that borrow checker had me down. It seemed doable, but at the time, under time pressure to get the Newsletter out, I reverted to Python and moved on.

Alas, I’m back again, a glutton for punishment. This time I thought I should try another crack at parsing a graph with Rust, but in a real-life situation, no more made-up stuff.  Actual data, actual graph, here we go. All code is on GitHub.

Read more

Sometimes I think Data Engineering is the same as it was 10+ years ago when I started doing it, and sometimes I think everything has changed. It’s probably both. In some ways, the underlying concepts have not moved an inch, some certain truths and axioms still rule over us all like some distant landlord, requiring us to pay the piper at a moment’s notice. Still, with all those things that haven’t changed, the size, velocity, and types of data have exploded. Data sources have run wild, multiple cloud providers, and a plethora of tooling. 

So yes, maybe in a lot of ways Data Engineering has changed, or at least how we do something is a new and wild frontier, with beasts around every corner waiting to devour us in our ignorance. Never mind the wild groups of zealots roaming around seeking converts to their cause and spitting on those unwilling to bend.

Probably like many of you, I’ve had a healthy skepticism of all things new, at least until they have proved themselves out over some time. This is both a good and a bad habit. It can protect you from undo harm and foolishness, but can also be lost opportunity when you pass over the diamond in the rough. I for one, think that if something is worth its weight in salt, it is usually clear, and its obvious value can be discerned quite readily.

Read more

One of my greatest pleasures in life is watching the r/dataengineering Reddit board, I find it very entertaining and enlightening on many levels. It gives a fairly unique view into the wide range of Data Engineering companies, jobs, projects people are working on, tech stacks, and problems that are being faced.

One thing I’ve come to realize over the years, working on many different Data Teams, and backed up by a casual observation of discussions on Reddit and other places, is that despite us living in the age of ChatGPT, Data Engineering teams generally seem to lag far behind in most areas of the Development Lifecycle.

So, to fix all the problems in the entire world and save humanity and Data Engineers from themselves, I give you the gift of telling you how to do your job. You’re welcome.

Read more

Polars is one of those tools that you just want … no … NEED a reason to use it. It’s gotten so bad, I’ve started to use it in my Rust code on the side, Polars that is. I mean you have a problem if you could use Polars Python, and you find yourself using Polars Rust. Glutton for punishment I guess.

I also recently took personal offense when someone at a birthday party told me that everyone uses Pandas, and no one uses Polars in the real world. Dang. That hurt.

The reality is that I know it takes a long while for even the best technologies to be adopted. Things don’t just change overnight. But there are two hidden gems of Polars that will hasten the day when Polars replaced Pandas for good. Let’s talk about them.

Read more