Big Data Archives - Page 7 of 11 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

Is there any problem more classic to the Data Lakes and Data Warehouses than duplicate records? You would think after doing the same ETL for over a decade I could avoid the issue, apparently not. It’s good never to think too highly of one’s self, the duplicates can get us all. Today I want to talk about a wonderful feature of Databricks + Delta Lake MERGE statements that are perfect for quietly and insidiously injecting duplicates into your Data Warehouse or Data Lake. This is a great trick to play on your unsuspecting coworkers.

February 3, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines.

In the defense of many data engineers, because of the varied backgrounds people come from, some were never taught or realized the importance of good software design and testing practices. Sure, it always “takes more time” upfront to design data pipelines with code that is functional and unit-testable, and worse, able to be integration tested from end to end. It requires some foresight and thought in both data architecture and pipeline design to enable complete testability.

Integration testing end-to-end in an automated manner is a tough nut to crack. How can you do such a thing on massive pipelines that crunch hundreds of TBs of data? With a little creativity.

February 1, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Now we are getting to the crux of the matter. I would say Data Modeling is probably one of the most unaddressed, yet important parts of Data Warehousing, Data Lakes, and Lake Houses. It raises the most questions and concerns and is responsible for the rise and fall of many Data Engineers.

This is what really drives the difference between the”big three”, Data Modeling.

January 17, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. In Part 2 We are digging into the common Big Data tools and how those technologies have a direct impact on Data Models and what kind of Datastore ends up being designed.

Part 1 – What are Data Warehouses, Data Lakes, and Lake Houses?

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 4 – Keys To Sucess – Idemptoency and Partitioning.

Part 5 – Serving Data from your Data Warehouse, Data Lake, or Lake House.

January 15, 2022

Big Data, Data, Data Engineering, Data Warehousing

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools and solutions, each of them pumping out content designed to solve all your data needs.

I’ve seen a lot of content out there by SAAS vendors, and by folks who ascribe to a said vendor, about Data Lakes and Lake Houses, new schema designs and approaches, and it’s hard to know what is just a sales tactic and what is real. I’m going to stir the pot.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. Enjoy.

January 1, 2022

Big Data, Data, Data Engineering

2 Useful PySpark Functions

I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to say though, where things get a little dicey is when you need to do something maybe “out-the-box”, say, strange text manipulations, something that is easy in Python becomes a challenge in PySpark using DataFrame API functionality.

Sure, you could use a udf written in Python for that, but we all know the performance penalty for that. Many times I just try to get creative with a combination of PySpark functions to accomplish the same task others would use a udf for.

I want to talk about two wonderful PySpark functions I find myself using a lot, they come in handy and I rarely see them used, hopefully, they come in handy for you!

December 28, 2021

Big Data, Data, Data Engineering, Ramblings

DataFrames vs SparkSQL – Which One Should You Choose?

I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark was the managing of Spark clusters etc. That has all changed the last few years with EMR, Databricks, and the like.

Back in those days, it was common for most Spark pipelines to be written with the DataFrame API, you didn’t see much SparkSQL around. I’m going to talk about how that has changed, what you should be using, and why.

December 27, 2021

Big Data, Data, Data Engineering, Data Warehousing

Hive Metastore in Databricks – What To Know.

Hive is like the zombie apocalypse of the Big Data world, it can’t be killed, it keeps coming back. More specifically the lesser-known Hive Metastore is the little sneaker that has wormed its way into a lot of Big Data tooling and platforms, in a quasi behind the scenes way. Many people don’t realize it, but Hive Metastore is the beating heart behind many systems, including Databricks. It’s one of those topics that sneaks up on you, ignore it happily at your own peril, till all of a sudden you need to know everything about it.

Specifically, I want to talk about Hive MetaStore as related to Databricks, how it works inside the Databricks platform, and what you need to know. I tripped myself up a lot during my initial forays into Databricks at a Production level. When you wander outside the realm of Notebooks, which you should, strange things start to happen. Databricks seems to assume you already have your own Hive Metastore, maybe like the Glue Data Catalog, or that you want to set up your own somewhere. But what if you don’t?

December 5, 2021

Big Data, Data, Data Engineering, Data Warehousing

Lessons Learned from MERGE operations with Billions of Records on Databricks Spark

Something happens with you starting working with 10’s of billions of records and data sets that are hundreds of TBs in size. Do you know what happens? Things stop working, that’s what. I miss the days where 1-10 TBs were considered large and in charge. the good ole days.

I want to talk about lessons learned from working with MERGE INTO using Databricks Sparks. The suggestions, the marketing material, the internet, and what you actually need to do to gain reasonable performance. It’s easy to say … “here … use this new feature, you will get % 50-speed improvements.” Yeah right. Honestly, new features and fancy tricks always help, but typically it comes down to the fundamentals. The “boring” stuff if you will, that make or break Big Data operations.

December 1, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

CTE vs SubQuery

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. 15 years ago working on SQL Server I would ask myself the same question.

Are they really that different at all? Is it just a matter of preference? Let’s take a look at a few examples of CTE vs Subquery using SparkSQL as an example and see what we see.

November 18, 2021

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

2 Useful PySpark Functions

DataFrames vs SparkSQL – Which One Should You Choose?

Hive Metastore in Databricks – What To Know.

Lessons Learned from MERGE operations with Billions of Records on Databricks Spark

CTE vs SubQuery

Interesting links

Pages

Categories

Archive