Data Warehousing Archives - Confessions of a Data Guy

Big Data, Data Engineering, Data Warehousing

dbt on Databricks

Running dbt on Databricks has never been easier. The integration between dbtcore and Databricks could not be more simple to set up and run. Wondering how to approach running dbt models on Databricks with SparkSQL? Watch the tutorial below.

March 28, 2025

Big Data, Data, Data Engineering, Data Warehousing, Python

How I (Barely) Survived Setting Up Polaris as an Iceberg REST Catalog

There are things in life that are satisfying—like a clean DAG run, a freshly brewed cup of coffee, or finally deleting 400 lines of YAML. Then there are things that make you question your life choices. Enter: setting up Apache Polaris (incubating) as an Apache Iceberg REST catalog.

Let’s get one thing out of the way—I didn’t want to do this.

March 26, 2025

Big Data, Data, Data Engineering, Data Warehousing

Apache XTable. Delta vs Iceberg vs Hudi.

The blog post reviews an Apache Incubating project called Apache XTable, which aims to provide cross-format interoperability among Delta Lake, Apache Hudi, and Apache Iceberg. Below is a concise breakdown from some time I spend playing around this this new tool and some technical observations:

March 4, 2025

Big Data, Data, Data Engineering, Data Warehousing, DuckDB, Python

AWS Lambda + DuckDB + Polars + Daft + Rust

When it comes to building modern Lake House architecture, we often get stuck in the past, doing the same old things time after time. We are human; we are lemmings; it’s just the trap we fall into. Usually, that pit we fall into is called Spark. Now, don’t get me wrong; I love Spark. We couldn’t have what we have today in terms of Data Platforms if it wasn’t for Apache Spark.

January 30, 2025

Big Data, Data, Data Engineering, Data Warehousing

The Death of the Data Warehouse, replaced by the Lake House. Or Has It?

This is an interesting one indeed, it’s one that teases and puzzles the brain to no end. Has the Data Warehouse finally died, has that unruly upstart the Lake House finally taken its place atop the seething mass of data we call home? Can we say that after all these decades the Data Warehouse Toolkit and Kimball is finally gone the way of the dinosaurs? Maybe. Probably. I don’t know.

October 7, 2024

Data, Data Engineering, Data Warehousing

Data Modeling in the Brave New Lakehouse World

It is a Brave New World out there these days. The new tools and features come out faster than your mom on Sunday morning getting you ready for church. The same goes for the context and advice being produced on a myriad of platforms, the ole’ Like and Subscribe, and all that bit. It does make you wonder after a while, what you can trust, who has your best interest in mind, and who is selling you a bottle of snake oil, doesn’t it?

Today we talk about Data Modeling. Specifically Data Modeling in the new world we all live in christened The Lakehouse by our benevolent Vender Overlords.

September 19, 2024

Data Warehousing, Ramblings

Databricks Buys Tabular – 1 Billion Dollar Deal. Iceberg vs Delta Lake?

The battle for the Data Warehouse, Data Lake, Lake House, or whatever you want to call it, in the age of AI just got more interesting. In an unsurprising move, Databricks has announced plans to buy Tabular for 1 billion dollars, beating out Snowflake who was reportedly trying to do the same thing.

June 4, 2024

Data, Data Engineering, Data Warehousing, SQL

The Case of the Mysterious Recursive CTE

I still remember that day. A day that shall live on in infamy in my mind. Well over a decade ago, in the days when SQL Server roamed the land devouring souls on the Altar of Stored Procedures. There was only one tool available at the time. SQL. That’s it. There was one problem that had to be solved.

The answer? A recursive CTE.

At the same time … both a demon of the dark and a shining angel from the heavens. Just depends on your view.

August 24, 2023

Big Data, Data, Data Engineering, Data Warehousing

Real Talk about Running Databricks + Delta Lake at Scale.

Photo by Michael Carruth on Unsplash

Anyone who’s been working in Data Land for any time at all, knows that the reality of life very rarely matches the glut of shiny snake oil we get sold on a daily basis. That’s just part of life. Every new tool, every single thingy-ma-bob we think is going to solve all our problems and send us happily into the state of nirvana inside our eternal data pipelines, is a lesson in disappointment.

I get it, there are a lot of nice tools out there. I use some of them every day. But, a healthy dose of reality is good for us all. Don’t lie to yourself. There is no such thing as the perfect tool. There are good tools, bad tools, and tools in between. The Truth is that all tools get pushed to their limits at some point.

We work on small teams, we don’t have all the time in the world, and we have to deliver our data at some point, perfect or not. We cut corners, hopefully, the right ones. That’s part of being wise and putting years of data experience to work. Today I’m going to talk about my experience of running Databricks + Delta Lake at scale. What happens when you use Databricks to ingest and deal with 10’s of millions of records a day, billions+ records a month?

April 26, 2023

Big Data, Data, Data Engineering, Data Warehousing

Data Types in Delta Lake + Spark. Join and Storage Performance.

Photo by Amador Loureiro on Unsplash

Hmm … data types. We all know they are important, but we don’t take them very seriously. I mean we know the difference between boolean, string, and integers, those are easy to get right. But we all get sloppy, sometimes we got the string and varchar route because we don’t spend enough time on the data model to care.

Can a string versus a int or bigint in Delta Lake with Spark have a big impact on performance? Data size? Does it matter? Let’s find out.

February 11, 2023

dbt on Databricks

How I (Barely) Survived Setting Up Polaris as an Iceberg REST Catalog

Apache XTable. Delta vs Iceberg vs Hudi.

AWS Lambda + DuckDB + Polars + Daft + Rust

The Death of the Data Warehouse, replaced by the Lake House. Or Has It?

Data Modeling in the Brave New Lakehouse World

Databricks Buys Tabular – 1 Billion Dollar Deal. Iceberg vs Delta Lake?

The Case of the Mysterious Recursive CTE

Real Talk about Running Databricks + Delta Lake at Scale.

Data Types in Delta Lake + Spark. Join and Storage Performance.

Interesting links

Pages

Categories

Archive