Data Engineering Archives - Page 14 of 23 - Confessions of a Data Guy

Data, Data Engineering, Python, Ramblings

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

For any Data Engineer working on aws for any length of time, there is one task that always seems to come up and never go away. Manipulating files on s3 a bucket on aws is something I’ve had to do for years, it just never goes away. It’s always something … listing files, moving files, copying files, checking for files, getting the last modified file, checking file sizes, downloading files … it pretty much never ends.

Luckily aws provides a few tools to make these easy, their handy cli for command-line work, or the trusty boto3 Python package. I want to give an introduction to the common commands Data Engineers have to run with both the aws cli and boto3 to perform various common tasks. We will then compare and contrast which tool to use in our pipelines and the pros and cons of each.

February 28, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 4 – Keys To Success – Idempotency and Partitioning.

As the road winds on we come to Part 4, of our 5 Part Series on Data Warehouses, Lakes, and Lake Houses. Finally, we are getting to some fun topics after all the boring stuff. Today I want to talk about the two keys to success in your Data Lakes … Idempotency and Partitioning. I firmly believe these two concepts are the cornerstones of the new exciting, or not-so-exciting world of Data Lakes and Lake Houses, without which your data and pipelines go the way of the dodo.

February 9, 2022

Big Data, Data, Data Engineering, Data Warehousing

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

Is there any problem more classic to the Data Lakes and Data Warehouses than duplicate records? You would think after doing the same ETL for over a decade I could avoid the issue, apparently not. It’s good never to think too highly of one’s self, the duplicates can get us all. Today I want to talk about a wonderful feature of Databricks + Delta Lake MERGE statements that are perfect for quietly and insidiously injecting duplicates into your Data Warehouse or Data Lake. This is a great trick to play on your unsuspecting coworkers.

February 3, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

The testing never ends. Tests tests tests, and more tests. When it comes to data engineering and data pipelines it seems good practices are finally catching up after years. In the past, the data engineering community took a lot of heat, and rightly so, for not adopting good software engineering principles, especially in data pipelines.

In the defense of many data engineers, because of the varied backgrounds people come from, some were never taught or realized the importance of good software design and testing practices. Sure, it always “takes more time” upfront to design data pipelines with code that is functional and unit-testable, and worse, able to be integration tested from end to end. It requires some foresight and thought in both data architecture and pipeline design to enable complete testability.

Integration testing end-to-end in an automated manner is a tough nut to crack. How can you do such a thing on massive pipelines that crunch hundreds of TBs of data? With a little creativity.

February 1, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Now we are getting to the crux of the matter. I would say Data Modeling is probably one of the most unaddressed, yet important parts of Data Warehousing, Data Lakes, and Lake Houses. It raises the most questions and concerns and is responsible for the rise and fall of many Data Engineers.

This is what really drives the difference between the”big three”, Data Modeling.

January 17, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. In Part 2 We are digging into the common Big Data tools and how those technologies have a direct impact on Data Models and what kind of Datastore ends up being designed.

Part 1 – What are Data Warehouses, Data Lakes, and Lake Houses?

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 4 – Keys To Sucess – Idemptoency and Partitioning.

Part 5 – Serving Data from your Data Warehouse, Data Lake, or Lake House.

January 15, 2022

Big Data, Data, Data Engineering, Data Warehousing

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

Even I get confused these days. Data Warehouse, Data Lake, and Lake Houses … why do we have three, what are the differences? Is it all just marketing huff-a-luff? Technology and life in the data world seem to be changing fast these days. Lot’s of new vendors on the streets trying to hawk their tools and solutions, each of them pumping out content designed to solve all your data needs.

I’ve seen a lot of content out there by SAAS vendors, and by folks who ascribe to a said vendor, about Data Lakes and Lake Houses, new schema designs and approaches, and it’s hard to know what is just a sales tactic and what is real. I’m going to stir the pot.

This is a start of a 5 part series on Demystifying Data Warehouses / Data Lakes / Lake Houses. Enjoy.

January 1, 2022

Big Data, Data, Data Engineering

2 Useful PySpark Functions

I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to say though, where things get a little dicey is when you need to do something maybe “out-the-box”, say, strange text manipulations, something that is easy in Python becomes a challenge in PySpark using DataFrame API functionality.

Sure, you could use a udf written in Python for that, but we all know the performance penalty for that. Many times I just try to get creative with a combination of PySpark functions to accomplish the same task others would use a udf for.

I want to talk about two wonderful PySpark functions I find myself using a lot, they come in handy and I rarely see them used, hopefully, they come in handy for you!

December 28, 2021

Big Data, Data, Data Engineering, Ramblings

DataFrames vs SparkSQL – Which One Should You Choose?

I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark was the managing of Spark clusters etc. That has all changed the last few years with EMR, Databricks, and the like.

Back in those days, it was common for most Spark pipelines to be written with the DataFrame API, you didn’t see much SparkSQL around. I’m going to talk about how that has changed, what you should be using, and why.

December 27, 2021

Data, Data Engineering, Python, SQL

Performance Testing Postgres Inserts with Python

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I worked on a lot during the beginning of my data career. Relational databases plus Python was pretty much the name of the game.

One of the struggles I always had was how fast can I load this data into Postgres? psycopg2 was always my Python package of choice for working with Postgres, it’s a wonderful library. Today I want to give a shout-out to my old self by performance testing Python inserts into Postgres. There are about a million ways and sizes and shapes to getting a bunch of records from some CSV file, through Python, and into Postgres.

I also enjoy making people mad … there’s always that. Nothing makes people mad at you like a good ol’ performance test 🙂

December 17, 2021

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

Part 4 – Keys To Success – Idempotency and Partitioning.

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

End-to-End Pipeline Integration Testing for Databricks + Delta Lake.

Part 3 – Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.

Part 2 – How Technology Platforms affect your Data Warehouse, Data Lake, and Lake Houses.

5 Part Series – Demystifying Data Warehouses / Data Lakes / Lake Houses

2 Useful PySpark Functions

DataFrames vs SparkSQL – Which One Should You Choose?

Performance Testing Postgres Inserts with Python

Interesting links

Pages

Categories

Archive