Home - Confessions of a Data Guy

Finally, a Simple, Cloud-Friendly Apache Iceberg Catalog That Just Works

Let’s be honest: working with Apache Iceberg stops being fun the moment you step off your local laptop and into anything that resembles production. The catalog system—mandatory and rigid—has long been the Achilles’ heel of an otherwise promising open data format. For a long time, you had two options: over-engineered corporate-grade solutions that require infrastructure […]

May 23, 2025

Data, Data Engineering, Python

Convert CSV to Excel with DuckDB, Polars, etc.

Every so often, I have to convert some .txt or .csv file over to Excel format … just because that’s how the business wants to consume or share the data. It is what it is. This means I am often on the lookup for some easy to use, simple, one-liners that I can use to […]

April 24, 2025

Uncategorized

Cloudflare R2 Storage with Apache Iceberg

Rethinking Object Storage: A First Look at Cloudflare R2 and Its Built‑In Apache Iceberg Catalog Sometimes, we follow tradition because, well, it works—until something new comes along and makes us question the status quo. For many of us, Amazon S3 is that well‑trodden path: the backbone of our data platforms and pipelines, used countless times each day. If […]

April 22, 2025

Big Data, Data Engineering, Data Warehousing

dbt on Databricks

Running dbt on Databricks has never been easier. The integration between dbtcore and Databricks could not be more simple to set up and run. Wondering how to approach running dbt models on Databricks with SparkSQL? Watch the tutorial below.

March 28, 2025

Big Data, Data, Data Engineering, Data Warehousing, Python

How I (Barely) Survived Setting Up Polaris as an Iceberg REST Catalog

There are things in life that are satisfying—like a clean DAG run, a freshly brewed cup of coffee, or finally deleting 400 lines of YAML. Then there are things that make you question your life choices. Enter: setting up Apache Polaris (incubating) as an Apache Iceberg REST catalog. Let’s get one thing out of the […]

March 26, 2025

Data, Data Engineering, Python

Reading Excel (.xlsx) Files with Polars

I make it my duty in life to never have to open an Excel file (xlsx); I feel like if I do, then I made a critical error in my career trajectory. But, I recently had no choice but to open an Excel on a Mac (or try) to look at some sample data from […]

March 18, 2025

Data, Data Engineering

dbt on Databricks.

Context and Motivation dbt (Data Build Tool): A popular open-source framework that organizes SQL transformations in a modular, version-controlled, and testable way. Databricks: A platform that unifies data engineering and data science pipelines, typically with Spark (PySpark, Scala) or SparkSQL. The post explores whether a Databricks environment—often used for Lakehouse architectures—benefits from dbt, especially if […]

March 4, 2025

Big Data, Data, Data Engineering, Data Warehousing

Apache XTable. Delta vs Iceberg vs Hudi.

The blog post reviews an Apache Incubating project called Apache XTable, which aims to provide cross-format interoperability among Delta Lake, Apache Hudi, and Apache Iceberg. Below is a concise breakdown from some time I spend playing around this this new tool and some technical observations:

March 4, 2025

Big Data, Data, Data Engineering

What is a Healthy Lake House?

Maybe I’m the only one who thinks about it, not sure. The Lake House has become the new Data Warehouse, yet when I ask this question “What makes a health Lake House?” no one is sure what the answer is, or you get different answers. It seems like a pretty important question considering that Lake […]

February 25, 2025

Big Data, Data, Data Engineering, Data Warehousing, DuckDB, Python

AWS Lambda + DuckDB + Polars + Daft + Rust

When it comes to building modern Lake House architecture, we often get stuck in the past, doing the same old things time after time. We are human; we are lemmings; it’s just the trap we fall into. Usually, that pit we fall into is called Spark. Now, don’t get me wrong; I love Spark. We […]

January 30, 2025

Finally, a Simple, Cloud-Friendly Apache Iceberg Catalog That Just Works

Convert CSV to Excel with DuckDB, Polars, etc.

Cloudflare R2 Storage with Apache Iceberg

dbt on Databricks

How I (Barely) Survived Setting Up Polaris as an Iceberg REST Catalog

Reading Excel (.xlsx) Files with Polars

dbt on Databricks.

Apache XTable. Delta vs Iceberg vs Hudi.

What is a Healthy Lake House?

AWS Lambda + DuckDB + Polars + Daft + Rust

Interesting links

Pages

Categories

Archive