Data Engineering Archives - Page 9 of 23 - Confessions of a Data Guy

Polars – Laziness and SQL Context.

Polars is one of those tools that you just want … no … NEED a reason to use it. It’s gotten so bad, I’ve started to use it in my Rust code on the side, Polars that is. I mean you have a problem if you could use Polars Python, and you find yourself using Polars Rust. Glutton for punishment I guess.

I also recently took personal offense when someone at a birthday party told me that everyone uses Pandas, and no one uses Polars in the real world. Dang. That hurt.

The reality is that I know it takes a long while for even the best technologies to be adopted. Things don’t just change overnight. But there are two hidden gems of Polars that will hasten the day when Polars replaced Pandas for good. Let’s talk about them.

May 7, 2023

Big Data, Data, Data Engineering, Data Warehousing

Real Talk about Running Databricks + Delta Lake at Scale.

Photo by Michael Carruth on Unsplash

Anyone who’s been working in Data Land for any time at all, knows that the reality of life very rarely matches the glut of shiny snake oil we get sold on a daily basis. That’s just part of life. Every new tool, every single thingy-ma-bob we think is going to solve all our problems and send us happily into the state of nirvana inside our eternal data pipelines, is a lesson in disappointment.

I get it, there are a lot of nice tools out there. I use some of them every day. But, a healthy dose of reality is good for us all. Don’t lie to yourself. There is no such thing as the perfect tool. There are good tools, bad tools, and tools in between. The Truth is that all tools get pushed to their limits at some point.

We work on small teams, we don’t have all the time in the world, and we have to deliver our data at some point, perfect or not. We cut corners, hopefully, the right ones. That’s part of being wise and putting years of data experience to work. Today I’m going to talk about my experience of running Databricks + Delta Lake at scale. What happens when you use Databricks to ingest and deal with 10’s of millions of records a day, billions+ records a month?

April 26, 2023

Data, Data Engineering, Python, SQL

DuckDB vs Polars for Data Engineering.

Photo by Liz Sanchez-Vegas on Unsplash

I was wondering the other day … since Polars now has a SQL context and is getting more popular by the day, do I need DuckDB anymore? These two tools are hot. Very hot. I haven’t seen this since Databricks and Snowflake first came out and started throwing mud at each other.

You might think it doesn’t matter. Two of one, half-dozen of another, whatever. But I think about these things. Simplicity is underrated these days. If you have two tools but could do it with one, should you use two? Probably depends on the Engineering culture you’re working in.

I mean just because you can doesn’t mean you should. Some data engineering repo with 50 different Python pip packages installed, constantly breaking and upgrading for no reason. CI/CD build failing, conflicts. Frustration. Why? Just because someone wants to do this one thing and decided they needed yet another package to do it.

April 16, 2023

Big Data, Data, Data Engineering, Python

The Dog Days of PySpark

Photo by Mohammad Mardani on Unsplash

PySpark. One of those things to hate and love, well … kinda hard not to love. PySpark is the abstraction that lets a bazillion Data Engineers forget about that blight Scala and cuddle their wonderfully soft and ever-kind Python code, while choking down gobs of data like some Harkonnen glutton.

But, that comes with a price. The price of our own laziness and that idea that all that glitters is gold, to take the easy path. One of the main problems is the dreadful mistake of mixing native Python in with your PySpark and expecting things to go fine at scale. Which it most assuredly will not.

April 15, 2023

Data, Data Engineering

Polars vs Spark. Real Talk.

Real talk. Polars is all the rage. People love Spark. People use Spark for small data, but data is too big for Pandas. Spark runs on a local machine. Polars runs on a local machine. What do I choose, Spark or Polars? Does it matter?

I’ve written about Polars at different points, here, and here when discussing wider topics. I mean honestly, I think Polars is the best tool to come out in the last 5 years of Data Engineering. But I find it unwaveringly boring. Which is why it’s so popular.

It’s boring for anyone who has used Pandas, Spark, or other Dataframe tools a lot. Sure, it can be a cool breeze in the face of some poor sap who’s been chained down to Pandas by some boss hanging around from a bygone era. You know what I’m talking about.

But honestly, overall, if you’re just an average engineering piddling around with datasets on your machine, what should you choose? Spark or Polars. Let’s talk some real talk.

March 28, 2023

Data, Data Engineering

Introduction to Linked Lists.

March 26, 2023

AI, Data, Data Engineering

Future Proof Yourself Against AI.

March 23, 2023

Data, Data Engineering

Contributing to Open-Source.

March 7, 2023

Big Data, Data, Data Engineering

What is a Data Mesh?

March 2, 2023

Data, Data Engineering, Python, Rust

AWS Lambdas – Python vs Rust. Performance and Cost Savings.

Save money, save money!! Hear Hear! Someone on Linkedin recently brought up the point that companies could save gobs of money by swapping out AWS Python lambdas for Rust ones. While it raised the ire of many a Python Data Engineer, I thought it sounded like a great idea. At least it’s an excuse to play with Rust, and I will take all those I can get. It does seem like an easy and obvious step to take in this age of cost-cutting that has come down on us all like that thick blanket of fog on a cool spring morning.

I can personally attest to the fact that I’ve written a number of Python AWS lambdas that are doing a non-trivial amount of data processing, currently running in Production and being triggered many times a day. Today, I’m going to reproduce both a Python and Rust lambda running on my personal AWS account doing pretty much the same exact work. Let’s see what the difference actually is in performance and see if it’s possible to find some cost savings.

February 26, 2023

Polars – Laziness and SQL Context.

Real Talk about Running Databricks + Delta Lake at Scale.

DuckDB vs Polars for Data Engineering.

The Dog Days of PySpark

Polars vs Spark. Real Talk.

Introduction to Linked Lists.

Future Proof Yourself Against AI.

Contributing to Open-Source.

What is a Data Mesh?

AWS Lambdas – Python vs Rust. Performance and Cost Savings.

Interesting links

Pages

Categories

Archive