Photo by Jason Leung on Unsplash

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that data.

Read more

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

Read more
Photo by Ray Hennessy on Unsplash

I’ve often wondered what purgatory would be like, doing penance for millennia into eternity. It would probably be doing data migrations. I suppose they are not all that dissimilar from normal software migrations, but there are a few things that make data migrations a little more horrible and soul-sucking. Data migrations are able to slow teams down to a crawl, take at least twice as long as planned, and be way more difficult than imagined.

Can’t it be made easy, shouldn’t Data Migrations have been conquered by now? I mean just put together the perfect plan, break up the work, make a bunch of tickets, estimate the work, and the rest falls into place? If only.

Read more
Photo by Francesco Alberti on Unsplash

Nothing captures the imagination and heart like a tale of betrayal and heartbreak, and that is a tale I want to bring to you today. It’s a tale of Databricks Workflows and Jobs, version changes, new features, API’s, and insidious little hidden gems that will make you pull your hair out when you find them. It’s a tale of what not to do, a tale of how to put developer and customer experience first, instead of forcing unwanted solutions down the throats of the little birdies feeding at your nest.

As a Data Engineering simplicity and ease of use is something close to my heart, something that Databricks did well, or maybe I should say used to do well … before recent releases like Jobs 2.1 API. I hope you can hear the bitterness oozing from my words.

Read more

Ok, so I don’t really mean all that. Or do I? I have no idea what the future holds. Sometimes it’s easy to pick out the winners, like Databricks and Snowflake, you can see, feel, and taste the results of those data products, a delicious and delectable bounty to feast upon. Other things are harder to read the tea leaves on. Kinda like Data Mesh … is it a thing, or is it not a thing? It’s hard to decern between charlatans and marketing/sales departments hocking the next Cure All Snake Oil and real life.

What about all this recent humdrum and buzz around Data Contracts? Pushed by some popular Data Engineering faces like Ananth Packkildurai and Chad Sanderson. What is all the hype about Data Contracts, are folks just pushing another tool down our throats? Is there a real issue and problem that can be solved with Data Contracts?

Read more
Photo by Mitchel Boot on Unsplash

Sometimes I feel like I’ve been doing this too long, life gets busy, and I don’t have much to say … but here I am 5 years later. I’m still making people mad and making a fool of myself, some things never change. This will probably be short and sweet. I will cover the top 10 most popular blog posts from those 5 years, what the traffic has looked like over time, and what I’ve learned from writing blogs for so long, the good, the bad, and the ugly.

Read more
Photo by Kevin Ku on Unsplash

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

Read more
Photo by Alain Pham on Unsplash

Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.

Read more
Photo by Dan Lohmar on Unsplash

In the beginning, I always thought the humdrum Big O Notation discussions should be reserved for Software Engineers who enjoyed working on such things. I mean, what could it possibly have to do with Data Engineering? I mean, if you are the person writing the Spark application, by all means, have at it, but if you are the Data Engineer who is simply using Spark, why can’t you just leave the details to the Devil? Seems to make sense.

The only problem with that logic is the longer you work as a Data Engineer, probably the harder the problems you work on become, you write more and more code, and basically end up being a specialized Software Engineer … even if you don’t want to be. In the end, to be a good Data Engineer you should at least attempt to understand the concepts behind Big O Notation, and how those concepts can apply to you as Data Engineer, especially for the ETL that most of us write.

Read more
Image: Saint Augustine of Hippo | Line engraving by P. Cool after M. de Vos | Wellcome Images

I’ve always enjoyed reading Mr. Augustine of Hippo, particularly “Confessions.” Ahead of his time in many ways. Although, you have to be into that sort of thing to find such topics interesting. It can be sort of dry, drawn out, verbose, and not for the faint of heart. Much like learning new programming languages. I’ve been messing with Golang off and on and here and there. Recently I added Rust to that list, more out of curiosity and to see what’s new in the world.

I’ve spent a lot of time thinking about the theology of programming in the space of Data Engineering. It’s such a wide area that encompasses so many different skills, Data Engineering that is. Why do we do what we do, write what we write? Like Augustine I see both old and new all around me, some things change, but many things stay the same.

People find hills like Python, Scala, Golang, Rust, and then promptly decide to die on them. I enjoy different things simply because of the way they teach you things about yourself and the world.

Read more