Home - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing, Python

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines. […]

February 1, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Now we are getting to the crux of the matter. I would say Data Modeling is probably one of the most unaddressed, yet important parts of Data Warehousing, Data Lakes, and Lake Houses. It raises the most questions and concerns and is responsible for the rise and fall of many Data Engineers. This is what […]

January 17, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. In Part 2 We are digging into the common Big Data tools and how those technologies have a direct impact on Data Models and what kind of Datastore ends up being designed. Part 1 – What […]

January 15, 2022

Big Data, Data, Data Engineering, Data Warehousing

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools […]

January 1, 2022

Big Data, Data, Data Engineering

2 Useful PySpark Functions

I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to […]

December 28, 2021

Big Data, Data, Data Engineering, Ramblings

DataFrames vs SparkSQL – Which One Should You Choose?

I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark […]

December 27, 2021

Data, Data Engineering, Python, SQL

Performance Testing Postgres Inserts with Python

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I […]

December 17, 2021

Big Data, Data, Data Engineering, Data Warehousing

Hive Metastore in Databricks – What To Know.

Hive is like the zombie apocalypse of the Big Data world, it can’t be killed, it keeps coming back. More specifically the lesser-known Hive Metastore is the little sneaker that has wormed its way into a lot of Big Data tooling and platforms, in a quasi behind the scenes way. Many people don’t realize it, […]

December 5, 2021

Big Data, Data, Data Engineering, Data Warehousing

Lessons Learned from MERGE operations with Billions of Records on Databricks Spark

Something happens with you starting working with 10’s of billions of records and data sets that are hundreds of TBs in size. Do you know what happens? Things stop working, that’s what. I miss the days where 1-10 TBs were considered large and in charge. the good ole days. I want to talk about lessons […]

December 1, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

CTE vs SubQuery

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. […]

November 18, 2021

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

2 Useful PySpark Functions

DataFrames vs SparkSQL – Which One Should You Choose?

Performance Testing Postgres Inserts with Python

Hive Metastore in Databricks – What To Know.

Lessons Learned from MERGE operations with Billions of Records on Databricks Spark

CTE vs SubQuery

Interesting links

Pages

Categories

Archive