This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. In Part 2 We are digging into the common Big Data tools and how those technologies have a direct impact on Data Models and what kind of Datastore ends up being designed. Part 1 – What […]

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools […]

I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to […]

I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark […]

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I […]

Hive is like the zombie apocalypse of the Big Data world, it can’t be killed, it keeps coming back. More specifically the lesser-known Hive Metastore is the little sneaker that has wormed its way into a lot of Big Data tooling and platforms, in a quasi behind the scenes way. Many people don’t realize it, […]

Something happens with you starting working with 10’s of billions of records and data sets that are hundreds of TBs in size. Do you know what happens? Things stop working, that’s what. I miss the days where 1-10 TBs were considered large and in charge. the good ole days. I want to talk about lessons […]

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. […]

Seriously, just don’t do it, they are bad for you. Listen to your mother, just say no. The dreaded ORM’s ( Object Relational Mapping ) that do all the hard SQL work for you. But, they come with many unintended consequences that are bad for your health and wellness in the long term. Many unsuspecting […]

I’m not sure what it is, but some prevailing evil in the Data Engineering world has made it not so common for PySpark pipelines to be unit tested. Who knows, it’s probably a combination of things. Data Engineers have been accused of not having good Software Engineering principles. Functional testing is a hot commodity in […]