Big Data Archives - Confessions of a Data Guy

Big Data, Data Engineering, Data Warehousing

dbt on Databricks

Running dbt on Databricks has never been easier. The integration between dbtcore and Databricks could not be more simple to set up and run. Wondering how to approach running dbt models on Databricks with SparkSQL? Watch the tutorial below.

March 28, 2025

Big Data, Data, Data Engineering, Data Warehousing, Python

How I (Barely) Survived Setting Up Polaris as an Iceberg REST Catalog

There are things in life that are satisfying—like a clean DAG run, a freshly brewed cup of coffee, or finally deleting 400 lines of YAML. Then there are things that make you question your life choices. Enter: setting up Apache Polaris (incubating) as an Apache Iceberg REST catalog.

Let’s get one thing out of the way—I didn’t want to do this.

March 26, 2025

Big Data, Data, Data Engineering, Data Warehousing

Apache XTable. Delta vs Iceberg vs Hudi.

The blog post reviews an Apache Incubating project called Apache XTable, which aims to provide cross-format interoperability among Delta Lake, Apache Hudi, and Apache Iceberg. Below is a concise breakdown from some time I spend playing around this this new tool and some technical observations:

March 4, 2025

Big Data, Data, Data Engineering

What is a Healthy Lake House?

Maybe I’m the only one who thinks about it, not sure. The Lake House has become the new Data Warehouse, yet when I ask this question “What makes a health Lake House?” no one is sure what the answer is, or you get different answers.

It seems like a pretty important question considering that Lake Houses have taken the data landscape by storm and now store the vast majority of our data. With all the vendors pumping out Lake House formats and platforms (think Delta Lake and Apache Iceberg), the main focus seems to be adding features and addressing internal data quality, aka the quality of the data stored in the Lake House itself.

February 25, 2025

Big Data, Data, Data Engineering, Data Warehousing, DuckDB, Python

AWS Lambda + DuckDB + Polars + Daft + Rust

When it comes to building modern Lake House architecture, we often get stuck in the past, doing the same old things time after time. We are human; we are lemmings; it’s just the trap we fall into. Usually, that pit we fall into is called Spark. Now, don’t get me wrong; I love Spark. We couldn’t have what we have today in terms of Data Platforms if it wasn’t for Apache Spark.

January 30, 2025

Big Data, Data, Data Engineering

What is a Data Platform?

You know, for all the hoards of content, books, and videos produced in the “Data Space” over the last few years, famous or others, it seems I find there are volumes of information on the pieces and parts of working in Data. It could be Data Quality, Data Modeling, Data Pipelines, Data Storage, Compute, and the list goes on. I found this to be a problem as I was growing in my “Data” career over the decades.

January 8, 2025

Big Data, Data, Data Engineering

Data Contracts were a LIE!

Today we talk about what is really going on with Data Contracts, they came in like a rocket a few years ago, but then died on the vine. What’s the deal?

December 13, 2024

Big Data, Data, Data Engineering

AWS S3 Tables. Technical Introduction.

Well, everyone is abuzz with the recently announced S3 Tables that came out of AWS reinvent this year. I’m going to call fools gold on this one right out of the gate. I tried them out, in real life that is, not just some marketing buzz, and it will leave most people, not all, be most, disappointed.

Surprise, surprise.

I wrote a more in-depth article here about the background and infighting between Databricks/Snowflake/AWS and the Lake House Storage Format wars. If you have time read that, but today, here, I just want to show you technically how to use S3 Tables in code.

Call it a technical introduction to S3 Tables.

December 7, 2024

Big Data, Data, Data Engineering, Python, SQL

DuckDB … reading from s3 … with AWS Credentials and more.

In my never-ending quest to plumb the most boring depths of every single data tool on the market, I found myself annoyed when recently using DuckDB for a benchmark that was reading parquet files from s3. What was not clear, or easy, was trying to figure out how DuckDB would LIKE to read default AWS credentials.

November 18, 2024

Big Data, Data, Data Engineering, Data Warehousing

The Death of the Data Warehouse, replaced by the Lake House. Or Has It?

This is an interesting one indeed, it’s one that teases and puzzles the brain to no end. Has the Data Warehouse finally died, has that unruly upstart the Lake House finally taken its place atop the seething mass of data we call home? Can we say that after all these decades the Data Warehouse Toolkit and Kimball is finally gone the way of the dinosaurs? Maybe. Probably. I don’t know.

October 7, 2024

dbt on Databricks

How I (Barely) Survived Setting Up Polaris as an Iceberg REST Catalog

Apache XTable. Delta vs Iceberg vs Hudi.

What is a Healthy Lake House?

AWS Lambda + DuckDB + Polars + Daft + Rust

What is a Data Platform?

Data Contracts were a LIE!

AWS S3 Tables. Technical Introduction.

DuckDB … reading from s3 … with AWS Credentials and more.

The Death of the Data Warehouse, replaced by the Lake House. Or Has It?

Interesting links

Pages

Categories

Archive