When it comes to building modern Lake House architecture, we often get stuck in the past, doing the same old things time after time. We are human; we are lemmings; it’s just the trap we fall into. Usually, that pit we fall into is called Spark. Now, don’t get me wrong; I love Spark. We couldn’t have what we have today in terms of Data Platforms if it wasn’t for Apache Spark.

Read more

A Deep Dive into Databricks Labs’ DQX: The Data Quality Game Changer for PySpark DataFrames

Recently, a LinkedIn announcement caught my eye—and honestly, it had me on the edge of my seat. Databricks Labs has unveiled DQX, a Python-based Data Quality framework explicitly designed for PySpark DataFrames.

Finally, a Dedicated Data Quality Tool for PySpark

Data Quality has always been a cyclical topic in the data community. Despite its importance, it’s been hampered by a lack of simple, open-source tools. Yes, we have options like Soda Core and Great Expectations, but they can be cumbersome to integrate. Enter DQX.

Read more

Every once in a great while, the question comes up: “How do I test my Databricks codebase?” It’s a fair question, and if you’re new to testing your code, it can seem a little overwhelming on the surface. However, I assure you the opposite is the case.

Read more

You know, for all the hoards of content, books, and videos produced in the “Data Space” over the last few years, famous or others, it seems I find there are volumes of information on the pieces and parts of working in Data. It could be Data Quality, Data Modeling, Data Pipelines, Data Storage, Compute, and the list goes on. I found this to be a problem as I was growing in my “Data” career over the decades.

Read more

Building fun things is a real part of Data Engineering. Using your creative side when building a Lake House is possible, and using tools that are outside the normal box can sometimes be preferable. Checkout this video where I dive into how I build just such a Lake House using Modern Data Stack tools like AWS Lambda (for cheap and fast compute), DuckDB (for data processing) and Delta Lake for storage.

 

I’ve been playing around more and more lately with DuckDB. It’s a popular SQL-based tool that is lightweight and easy to use, probably one of the easiest tools to install and use. I mean, who doesn’t know how to pip install something and write SQL? Probably the very first thing you learn when cutting your teeth on programming when you’re wet behind the ears.

Read more

We have all come to live in the Modern Data Stack, and whether we like it or not, our lives are no longer as simple as they were in the days of SQL Server and SSIS. Things have changed A LOT. There are good and bad sides to that coin.  The Modern Data Stack has brought us amazing innovations and tools and made things possible that were simply unheard of before.

Read more