January 2025 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing, DuckDB, Python

AWS Lambda + DuckDB + Polars + Daft + Rust

When it comes to building modern Lake House architecture, we often get stuck in the past, doing the same old things time after time. We are human; we are lemmings; it’s just the trap we fall into. Usually, that pit we fall into is called Spark. Now, don’t get me wrong; I love Spark. We couldn’t have what we have today in terms of Data Platforms if it wasn’t for Apache Spark.

January 30, 2025

Data, Data Engineering, Data Quality

PySpark Data Quality on Databricks with DQX.

A Deep Dive into Databricks Labs’ DQX: The Data Quality Game Changer for PySpark DataFrames

Recently, a LinkedIn announcement caught my eye—and honestly, it had me on the edge of my seat. Databricks Labs has unveiled DQX, a Python-based Data Quality framework explicitly designed for PySpark DataFrames.

Finally, a Dedicated Data Quality Tool for PySpark

Data Quality has always been a cyclical topic in the data community. Despite its importance, it’s been hampered by a lack of simple, open-source tools. Yes, we have options like Soda Core and Great Expectations, but they can be cumbersome to integrate. Enter DQX.

January 17, 2025

Uncategorized

Testing and Development for Databricks Environment and Code.

Every once in a great while, the question comes up: “How do I test my Databricks codebase?” It’s a fair question, and if you’re new to testing your code, it can seem a little overwhelming on the surface. However, I assure you the opposite is the case.

January 11, 2025

Big Data, Data, Data Engineering

What is a Data Platform?

You know, for all the hoards of content, books, and videos produced in the “Data Space” over the last few years, famous or others, it seems I find there are volumes of information on the pieces and parts of working in Data. It could be Data Quality, Data Modeling, Data Pipelines, Data Storage, Compute, and the list goes on. I found this to be a problem as I was growing in my “Data” career over the decades.

January 8, 2025

Uncategorized

Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda

Building fun things is a real part of Data Engineering. Using your creative side when building a Lake House is possible, and using tools that are outside the normal box can sometimes be preferable. Checkout this video where I dive into how I build just such a Lake House using Modern Data Stack tools like AWS Lambda (for cheap and fast compute), DuckDB (for data processing) and Delta Lake for storage.

January 8, 2025

Uncategorized

Using DuckDB to read JSON files in S3

I’ve been playing around more and more lately with DuckDB. It’s a popular SQL-based tool that is lightweight and easy to use, probably one of the easiest tools to install and use. I mean, who doesn’t know how to pip install something and write SQL? Probably the very first thing you learn when cutting your teeth on programming when you’re wet behind the ears.

January 7, 2025

Data, Data Engineering

Simplicity in the Modern Data Stack

We have all come to live in the Modern Data Stack, and whether we like it or not, our lives are no longer as simple as they were in the days of SQL Server and SSIS. Things have changed A LOT. There are good and bad sides to that coin. The Modern Data Stack has brought us amazing innovations and tools and made things possible that were simply unheard of before.

January 1, 2025

AWS Lambda + DuckDB + Polars + Daft + Rust

PySpark Data Quality on Databricks with DQX.

A Deep Dive into Databricks Labs’ DQX: The Data Quality Game Changer for PySpark DataFrames

Finally, a Dedicated Data Quality Tool for PySpark

Testing and Development for Databricks Environment and Code.

What is a Data Platform?

Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda

Using DuckDB to read JSON files in S3

Simplicity in the Modern Data Stack

Interesting links

Pages

Categories

Archive