Ramblings Archives - Page 3 of 6 - Confessions of a Data Guy

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that data.

December 30, 2022

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

December 10, 2022

Big Data, Data, Data Engineering, Ramblings

Why Data Migrations Suck.

I’ve often wondered what purgatory would be like, doing penance for millennia into eternity. It would probably be doing data migrations. I suppose they are not all that dissimilar from normal software migrations, but there are a few things that make data migrations a little more horrible and soul-sucking. Data migrations are able to slow teams down to a crawl, take at least twice as long as planned, and be way more difficult than imagined.

Can’t it be made easy, shouldn’t Data Migrations have been conquered by now? I mean just put together the perfect plan, break up the work, make a bunch of tickets, estimate the work, and the rest falls into place? If only.

December 5, 2022

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

A Tale of Betrayal and Heartbreak – Databricks Workflows and Jobs.

Nothing captures the imagination and heart like a tale of betrayal and heartbreak, and that is a tale I want to bring to you today. It’s a tale of Databricks Workflows and Jobs, version changes, new features, API’s, and insidious little hidden gems that will make you pull your hair out when you find them. It’s a tale of what not to do, a tale of how to put developer and customer experience first, instead of forcing unwanted solutions down the throats of the little birdies feeding at your nest.

As a Data Engineering simplicity and ease of use is something close to my heart, something that Databricks did well, or maybe I should say used to do well … before recent releases like Jobs 2.1 API. I hope you can hear the bitterness oozing from my words.

December 1, 2022

Data, Data Engineering, Data Quality, Ramblings

A Diatribe against Data Contracts and their Abuses.

Ok, so I don’t really mean all that. Or do I? I have no idea what the future holds. Sometimes it’s easy to pick out the winners, like Databricks and Snowflake, you can see, feel, and taste the results of those data products, a delicious and delectable bounty to feast upon. Other things are harder to read the tea leaves on. Kinda like Data Mesh … is it a thing, or is it not a thing? It’s hard to decern between charlatans and marketing/sales departments hocking the next Cure All Snake Oil and real life.

What about all this recent humdrum and buzz around Data Contracts? Pushed by some popular Data Engineering faces like Ananth Packkildurai and Chad Sanderson. What is all the hype about Data Contracts, are folks just pushing another tool down our throats? Is there a real issue and problem that can be solved with Data Contracts?

November 16, 2022

Data, Data Engineering, Ramblings

5 Years of Blogging – Most Popular Articles, Traffic Stats, and other Thoughts.

Sometimes I feel like I’ve been doing this too long, life gets busy, and I don’t have much to say … but here I am 5 years later. I’m still making people mad and making a fool of myself, some things never change. This will probably be short and sweet. I will cover the top 10 most popular blog posts from those 5 years, what the traffic has looked like over time, and what I’ve learned from writing blogs for so long, the good, the bad, and the ugly.

October 7, 2022

Big Data, Data, Data Engineering, Machine Learning, Ramblings

Machine Learning from the viewpoint of an average Data Engineer.

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

September 26, 2022

Data, Data Engineering, Ramblings

8 Data Engineering Best Practices

Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.

September 21, 2022

Big Data, Data, Data Engineering, Ramblings

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

In the beginning, I always thought the humdrum Big O Notation discussions should be reserved for Software Engineers who enjoyed working on such things. I mean, what could it possibly have to do with Data Engineering? I mean, if you are the person writing the Spark application, by all means, have at it, but if you are the Data Engineer who is simply using Spark, why can’t you just leave the details to the Devil? Seems to make sense.

The only problem with that logic is the longer you work as a Data Engineer, probably the harder the problems you work on become, you write more and more code, and basically end up being a specialized Software Engineer … even if you don’t want to be. In the end, to be a good Data Engineer you should at least attempt to understand the concepts behind Big O Notation, and how those concepts can apply to you as Data Engineer, especially for the ETL that most of us write.

August 15, 2022

Data, Data Engineering, Golang, Ramblings, Rust

Thoughts on Saint Augustine, Rust vs Golang. Complexity, verbosity, and other matters.

**Image: *Saint Augustine of Hippo* | Line engraving by P. Cool after M. de Vos | Wellcome Images**

I’ve always enjoyed reading Mr. Augustine of Hippo, particularly “Confessions.” Ahead of his time in many ways. Although, you have to be into that sort of thing to find such topics interesting. It can be sort of dry, drawn out, verbose, and not for the faint of heart. Much like learning new programming languages. I’ve been messing with Golang off and on and here and there. Recently I added Rust to that list, more out of curiosity and to see what’s new in the world.

I’ve spent a lot of time thinking about the theology of programming in the space of Data Engineering. It’s such a wide area that encompasses so many different skills, Data Engineering that is. Why do we do what we do, write what we write? Like Augustine I see both old and new all around me, some things change, but many things stay the same.

People find hills like Python, Scala, Golang, Rust, and then promptly decide to die on them. I enjoy different things simply because of the way they teach you things about yourself and the world.

July 15, 2022

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

Why Data Migrations Suck.

A Tale of Betrayal and Heartbreak – Databricks Workflows and Jobs.

A Diatribe against Data Contracts and their Abuses.

5 Years of Blogging – Most Popular Articles, Traffic Stats, and other Thoughts.

Machine Learning from the viewpoint of an average Data Engineer.

8 Data Engineering Best Practices

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

Thoughts on Saint Augustine, Rust vs Golang. Complexity, verbosity, and other matters.

Interesting links

Pages

Categories

Archive