Data Archives - Page 7 of 22 - Confessions of a Data Guy

Data, Data Engineering, Data Warehousing, SQL

The Case of the Mysterious Recursive CTE

I still remember that day. A day that shall live on in infamy in my mind. Well over a decade ago, in the days when SQL Server roamed the land devouring souls on the Altar of Stored Procedures. There was only one tool available at the time. SQL. That’s it. There was one problem that had to be solved.

The answer? A recursive CTE.

At the same time … both a demon of the dark and a shining angel from the heavens. Just depends on your view.

August 24, 2023

Data, Data Engineering

Introduction to AWS Lambda (deployment)

August 4, 2023

Data, Data Engineering

Introduction to Delta Lake

August 4, 2023

Data, Data Engineering, Python

Polars vs Pandas. Inside an AWS Lambda.

Nothing gives me greater joy than rocking the boat. I take pleasure in finding what people love most in tech and trying to poke holes in it. Everything is sacred. Nothing is sacred. I also enjoy doing simple things, things that have a “real-life” feel to them. I suppose I could be like the others and simply write boring tutorials on how to do the same old thing for the millionth time.

Ugh. No thanks.

Today I want to do something spectacularly normal. Something Data Engineers do. I’m simply going to write an AWS Lambda to process some data, one with Polars, and one with Pandas. What do I hope to accomplish?

Well, I can usually make a few people mad. AWS Architectures and fan clubs, Polars people, Pandas people, and the general public at large. Bring it.

All code on GitHub.

July 22, 2023

Big Data, Data, Data Engineering, Rust

Ballista (Rust) vs Apache Spark. A Tale of Woe.

Sometimes it seems like the Data Engineering landscape is starting to shoot off into infinity. With the rise of Rust, new tools like DuckDB, Polars, and whatever else, things do seem to shifting at a fundamental level. It seems like there is someone at the base of a titering rock with a crowbar, picking and prying away, determined to spill tools like Java, Scala, Python, Spark, and Airflow, the things we’ve known and loved for years, from their lofty thrones.

Maybe they all have had their time in the Data Engineering sun, maybe it’s time to shake things up. It seems to be happening. It’s always hard to have those we hold dear be poked and prodded at. I’ve been using Spark since before it was cool, so when I started to hear the word Ballista start to show up here and there, I took note.

Besides, I’ve been dabbling my grubby little fingers in Rust for some months now, and have seen The Light. Is it possible I could be living at the dawn of a new era? A new and exciting frontier of Data Engineering, finally, after all this time? Could Rust really take over? Will something like Ballista pull that old Spark from its distributed processing tower and claim its rightful place?

July 7, 2023

Big Data, Data, Data Engineering, Rust

Exploring Graphs in Rust. Yikes.

I’ve been a dog licking my wounds for some time now. Over on my Substack newsletter, I’ve been doing a small series on DSA (Data Structures and Algorithms). I tackled some of the easier stuff first, like Linked Lists, Binary Search, and the like. What’s more, I actually did most of it in Rust, since I’ve possibly, maybe slightly, every so slightly, fallen in love with Rust.

Like most relationships, it vacillates between pure adoration and utter hatred, depending on the problem at hand. When I did a recent article on Graphs, Queues, and BSF, I attempted it in Rust, and was struck a mighty blow, that borrow checker had me down. It seemed doable, but at the time, under time pressure to get the Newsletter out, I reverted to Python and moved on.

Alas, I’m back again, a glutton for punishment. This time I thought I should try another crack at parsing a graph with Rust, but in a real-life situation, no more made-up stuff. Actual data, actual graph, here we go. All code is on GitHub.

June 28, 2023

Big Data, Data, Data Engineering

Conceptual Introduction to Delta Lake.

June 22, 2023

Big Data, Data, Data Engineering

Old Dog Learn New Tricks? Starburst (Trino) Galaxy and other thoughts.

Sometimes I think Data Engineering is the same as it was 10+ years ago when I started doing it, and sometimes I think everything has changed. It’s probably both. In some ways, the underlying concepts have not moved an inch, some certain truths and axioms still rule over us all like some distant landlord, requiring us to pay the piper at a moment’s notice. Still, with all those things that haven’t changed, the size, velocity, and types of data have exploded. Data sources have run wild, multiple cloud providers, and a plethora of tooling.

So yes, maybe in a lot of ways Data Engineering has changed, or at least how we do something is a new and wild frontier, with beasts around every corner waiting to devour us in our ignorance. Never mind the wild groups of zealots roaming around seeking converts to their cause and spitting on those unwilling to bend.

Probably like many of you, I’ve had a healthy skepticism of all things new, at least until they have proved themselves out over some time. This is both a good and a bad habit. It can protect you from undo harm and foolishness, but can also be lost opportunity when you pass over the diamond in the rough. I for one, think that if something is worth its weight in salt, it is usually clear, and its obvious value can be discerned quite readily.

June 20, 2023

Data, Data Engineering

4 Ways To Setup Your Data Engineering Game.

One of my greatest pleasures in life is watching the r/dataengineering Reddit board, I find it very entertaining and enlightening on many levels. It gives a fairly unique view into the wide range of Data Engineering companies, jobs, projects people are working on, tech stacks, and problems that are being faced.

One thing I’ve come to realize over the years, working on many different Data Teams, and backed up by a casual observation of discussions on Reddit and other places, is that despite us living in the age of ChatGPT, Data Engineering teams generally seem to lag far behind in most areas of the Development Lifecycle.

So, to fix all the problems in the entire world and save humanity and Data Engineers from themselves, I give you the gift of telling you how to do your job. You’re welcome.

June 8, 2023

Data, Data Engineering, Python

Polars – Laziness and SQL Context.

Photo by Priscilla Du Preez on Unsplash

Polars is one of those tools that you just want … no … NEED a reason to use it. It’s gotten so bad, I’ve started to use it in my Rust code on the side, Polars that is. I mean you have a problem if you could use Polars Python, and you find yourself using Polars Rust. Glutton for punishment I guess.

I also recently took personal offense when someone at a birthday party told me that everyone uses Pandas, and no one uses Polars in the real world. Dang. That hurt.

The reality is that I know it takes a long while for even the best technologies to be adopted. Things don’t just change overnight. But there are two hidden gems of Polars that will hasten the day when Polars replaced Pandas for good. Let’s talk about them.

May 7, 2023

The Case of the Mysterious Recursive CTE

Introduction to AWS Lambda (deployment)

Introduction to Delta Lake

Polars vs Pandas. Inside an AWS Lambda.

Ballista (Rust) vs Apache Spark. A Tale of Woe.

Exploring Graphs in Rust. Yikes.

Conceptual Introduction to Delta Lake.

Old Dog Learn New Tricks? Starburst (Trino) Galaxy and other thoughts.

4 Ways To Setup Your Data Engineering Game.

Polars – Laziness and SQL Context.

Interesting links

Pages

Categories

Archive