I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to say though, where things get a little dicey is when you need to do something maybe “out-the-box”, say, strange text manipulations, something that is easy in Python becomes a challenge in PySpark using DataFrame API functionality.

Sure, you could use a udf written in Python for that, but we all know the performance penalty for that. Many times I just try to get creative with a combination of PySpark functions to accomplish the same task others would use a udf for.

I want to talk about two wonderful PySpark functions I find myself using a lot, they come in handy and I rarely see them used, hopefully, they come in handy for you!

Read more

I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark was the managing of Spark clusters etc. That has all changed the last few years with EMR, Databricks, and the like.

Back in those days, it was common for most Spark pipelines to be written with the DataFrame API, you didn’t see much SparkSQL around. I’m going to talk about how that has changed, what you should be using, and why.

Read more

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I worked on a lot during the beginning of my data career. Relational databases plus Python was pretty much the name of the game.

One of the struggles I always had was how fast can I load this data into Postgres? psycopg2 was always my Python package of choice for working with Postgres, it’s a wonderful library. Today I want to give a shout-out to my old self by performance testing Python inserts into Postgres. There are about a million ways and sizes and shapes to getting a bunch of records from some CSV file, through Python, and into Postgres.

I also enjoy making people mad … there’s always that. Nothing makes people mad at you like a good ol’ performance test 🙂

Read more

Hive is like the zombie apocalypse of the Big Data world, it can’t be killed, it keeps coming back. More specifically the lesser-known Hive Metastore is the little sneaker that has wormed its way into a lot of Big Data tooling and platforms, in a quasi behind the scenes way. Many people don’t realize it, but Hive Metastore is the beating heart behind many systems, including Databricks. It’s one of those topics that sneaks up on you, ignore it happily at your own peril, till all of a sudden you need to know everything about it.

Specifically, I want to talk about Hive MetaStore as related to Databricks, how it works inside the Databricks platform, and what you need to know. I tripped myself up a lot during my initial forays into Databricks at a Production level. When you wander outside the realm of Notebooks, which you should, strange things start to happen. Databricks seems to assume you already have your own Hive Metastore, maybe like the Glue Data Catalog, or that you want to set up your own somewhere. But what if you don’t?

Read more

Something happens with you starting working with 10’s of billions of records and data sets that are hundreds of TBs in size. Do you know what happens? Things stop working, that’s what. I miss the days where 1-10 TBs were considered large and in charge. the good ole days.

I want to talk about lessons learned from working with MERGE INTO using Databricks Sparks. The suggestions, the marketing material, the internet, and what you actually need to do to gain reasonable performance. It’s easy to say … “here … use this new feature, you will get % 50-speed improvements.” Yeah right. Honestly, new features and fancy tricks always help, but typically it comes down to the fundamentals. The “boring” stuff if you will, that make or break Big Data operations.

Read more

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. 15 years ago working on SQL Server I would ask myself the same question.

Are they really that different at all? Is it just a matter of preference? Let’s take a look at a few examples of CTE vs Subquery using SparkSQL as an example and see what we see.

Read more

Seriously, just don’t do it, they are bad for you. Listen to your mother, just say no. The dreaded ORM’s ( Object Relational Mapping ) that do all the hard SQL work for you. But, they come with many unintended consequences that are bad for your health and wellness in the long term. Many unsuspecting victims have been sucked into ORMs with the promise of an easier transition to allow programmers a familiar object-oriented design pattern for manipulating the data in a relational database, say Postgres or MySQL.

Again I tell you, don’t fall for the siren songs, there are tears and sorrow down the long and lonely ORM road.

Read more

I’m not sure what it is, but some prevailing evil in the Data Engineering world has made it not so common for PySpark pipelines to be unit tested. Who knows, it’s probably a combination of things. Data Engineers have been accused of not having good Software Engineering principles. Functional testing is a hot commodity in the Software Engineering world but probably takes a while to trickle its way into mainstream Data Engineering. It can require good Docker skills. Also, generally speaking, the old school Data and ETL Developers that preceded Data Engineers in the bygone days never unit tested …. so neither do their ancestors.

Who knows? All that being said I want to give you 3 tips to help you unit test your PySpark ETL data pipelines.

Read more

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can cost you time and money very easily when running large ETL pipelines on Databricks.

I want to share 7 tips for Databricks newbies, and oldies, that are foundational to good Data Engineering architecture, affecting both performance and cost.

Read more

Data Modeling is a topic that never goes away. Sometimes I do reminisce about the good ol’ days of Kimball-style data models, it was so simple, straightforward, just the same thing for years. Then Big Data happened, Spark happened. Things just changed. There is a lot of new content coming out around Data Lakes and data modeling, but it still seems like a fluid topic, with nothing as concrete as the classic Data Warehouse toolkit.

Oh, what to do what to do. I do believe there are a few key ideas and points to being successful with file-based Data Lake modeling. I think it’s a mistake to fully embrace the classic Kimball-style Data Warehouse approach. It really comes down to Relational Database SQL vs File-Based data models are going to be different, for technical and practical reasons.

Read more