Anyone who’s been working in Data Land for any time at all, knows that the reality of life very rarely matches the glut of shiny snake oil we get sold on a daily basis. That’s just part of life. Every new tool, every single thingy-ma-bob we think is going to solve all our problems and send us happily into the state of nirvana inside our eternal data pipelines, is a lesson in disappointment.

I get it, there are a lot of nice tools out there. I use some of them every day. But, a healthy dose of reality is good for us all. Don’t lie to yourself. There is no such thing as the perfect tool. There are good tools, bad tools, and tools in between. The Truth is that all tools get pushed to their limits at some point.

We work on small teams, we don’t have all the time in the world, and we have to deliver our data at some point, perfect or not. We cut corners, hopefully, the right ones. That’s part of being wise and putting years of data experience to work. Today I’m going to talk about my experience of running Databricks + Delta Lake at scale. What happens when you use Databricks to ingest and deal with 10’s of millions of records a day, billions+ records a month?

Read more

I was wondering the other day … since Polars now has a SQL context and is getting more popular by the day, do I need DuckDB anymore? These two tools are hot. Very hot. I haven’t seen this since Databricks and Snowflake first came out and started throwing mud at each other.

You might think it doesn’t matter. Two of one, half-dozen of another, whatever. But I think about these things. Simplicity is underrated these days. If you have two tools but could do it with one, should you use two? Probably depends on the Engineering culture you’re working in.

I mean just because you can doesn’t mean you should. Some data engineering repo with 50 different Python pip packages installed, constantly breaking and upgrading for no reason. CI/CD build failing, conflicts. Frustration. Why? Just because someone wants to do this one thing and decided they needed yet another package to do it.

Read more

 

PySpark. One of those things to hate and love, well … kinda hard not to love. PySpark is the abstraction that lets a bazillion Data Engineers forget about that blight Scala and cuddle their wonderfully soft and ever-kind Python code, while choking down gobs of data like some Harkonnen glutton.

But, that comes with a price. The price of our own laziness and that idea that all that glitters is gold, to take the easy path. One of the main problems is the dreadful mistake of mixing native Python in with your PySpark and expecting things to go fine at scale. Which it most assuredly will not.

Read more