In my never-ending quest to plumb the most boring depths of every single data tool on the market, I found myself annoyed when recently using DuckDB for a benchmark that was reading parquet files from s3. What was not clear, or easy, was trying to figure out how DuckDB would LIKE to read default AWS credentials.

Read more

I am a glutton for punishment, a harbinger of tidings, a storm crow, a prophet of the data land, my sole purpose is to plumb the depths of the tools we use every day in Data Engineering. I find the good, the bad, the ugly, and splay them out before you, string ’em up and quarter them.

Today, for the third time, we put that ole’ Duck to the test. I want to test to see if DuckDB has fixed their OOM (Out Of Memory) errors on commodity hardware … that age old problem of “larger than memory data sets.

Read more

There are some things you don’t need until you need them. I ran into that situation recently with needing to process some CSV / Flatfiles on short notice. At first, it appeared to be easy, but then I realized, as usual, there was a little monkey wrench thrown into the middle of it.

It is nothing earth-shattering, it’s just something that comes up so rarely that I forget there are ways to deal with these inconveniences without jumping through unnecessary hoops.

Read more

Is there anything worse than the PR process (Pull Request) at most companies? Probably not. It’s the dreaded 600-pound gorilla in the room that no one wants to talk about. Everyone hates it, everyone has to do it. But, it doesn’t have to be like that.

There are a few tried and true ways to make the perfect PR that takes all your problems away. Checkout the video for more.

 

This is an interesting one indeed, it’s one that teases and puzzles the brain to no end. Has the Data Warehouse finally died, has that unruly upstart the Lake House finally taken its place atop the seething mass of data we call home? Can we say that after all these decades the Data Warehouse Toolkit and Kimball is finally gone the way of the dinosaurs? Maybe. Probably. I don’t know.

Read more

I’ve been hacking around with tools and programming since Perl was a thing. I’ve worked the gambit of Data Platforms from large organizations to tiny startups, and all those in between. I’ve worked on Data Platforms that dropped ungodly amounts of money on SAP products, and places where we would build our own massive data processing platforms on Kubernetes.

Each to their own I guess.

Read more

I recently wrote on my Substack (Data Engineering Central) about how I used the new OpenAI o1 model to do some basic Data Engineering tasks surrounding PostgreSQL. It did ok. I’ve also been using CoPilot and ChatGPT for over a year now to assist me with my daily code that I have to write for one reason or another.

Read more

It is a Brave New World out there these days. The new tools and features come out faster than your mom on Sunday morning getting you ready for church. The same goes for the context and advice being produced on a myriad of platforms, the ole’ Like and Subscribe, and all that bit. It does make you wonder after a while, what you can trust, who has your best interest in mind, and who is selling you a bottle of snake oil, doesn’t it?

Today we talk about Data Modeling. Specifically Data Modeling in the new world we all live in christened The Lakehouse by our benevolent Vender Overlords.

Read more

Did you know there are only 3 types of Data Engineers? It’s true. I hope you are the right one.

Over the many years I’ve been pounding my keyboard … Perl, PHP, Python, C#, Rust … whatever … I, like most programmers, built up a certain disdain for what is called Low Code / No Code solutions. In my rush to worship at the feet of the code we create, I failed, in the beginning, to recognize some important axioms …

Read more