Data Warehousing Archives - Page 4 of 6 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing

6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Databricks, easily the hotest tool these days for Data Lakes and Data Warehousing, it’s a beast. As with any new technology there are always growing pains, learnings, and tips and tricks that might not be obvious to those dipping their toes in the water. Not understand certain concepts, and being unware of specific configurations can cost you time and money very easily when running large ETL pipelines on Databricks.

I want to share 7 tips for Databricks newbies, and oldies, that are foundational to good Data Engineering architecture, affecting both performance and cost.

October 29, 2021

Big Data, Data, Data Engineering, Data Warehousing

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Data Modeling is a topic that never goes away. Sometimes I do reminisce about the good ol’ days of Kimball-style data models, it was so simple, straightforward, just the same thing for years. Then Big Data happened, Spark happened. Things just changed. There is a lot of new content coming out around Data Lakes and data modeling, but it still seems like a fluid topic, with nothing as concrete as the classic Data Warehouse toolkit.

Oh, what to do what to do. I do believe there are a few key ideas and points to being successful with file-based Data Lake modeling. I think it’s a mistake to fully embrace the classic Kimball-style Data Warehouse approach. It really comes down to Relational Database SQL vs File-Based data models are going to be different, for technical and practical reasons.

October 20, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs AWS EMR – Theory and Real Life.

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for every technology, old or new … like our favorite band or sports team, fight to the death for what we love and cherish. I want to talk theoretically, and technically about Databricks and EMR, and why you should use Databricks. 🙂

August 19, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

Data Lake vs Data Warehouse – What’s the Dealio?

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

July 10, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs Snowflake. The DataLake/Warehouse Battle.

As someone who worked around the classic Data Warehouses back in the day, before s3 took over and SQL Server and Oracle ruled the day … I love sitting on the sidelines watching new … yet old battle-lines being re-drawn. I could probably scroll back in StackOverflow 12 years and find the same arguments and questions. In one sense Databricks and Snowflake are totally different tools … but are they? Distributed big data processing, apply transforms to data, enable Data Lake / Data Warehouse / Analytics at scale. There is a lot of bleed over between the two, it really comes down to what path you would like to take to get to the same goal.

June 28, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

Intro to Apache Druid … What is this Devilry

Apache Druid, kinda like that second cousin you know about … but don’t really know. When you see them for the first time in 10 years you kinda look at them out of the corner of your eye. That’s how I feel about Apache Druid, I’ve always known it has been there, lurking around in the shadows, but it rarely pokes it head out and I have no idea what, why, how it is used. Time to change that, for the better or worse. Let’s take 10,000 foot survey of Druid.

June 7, 2021

Big Data, Data, Data Engineering, Data Warehousing, Python

Why Data Engineer’s should use AWS Lambda Functions.

When I used to think of lambda functions on AWS my eyes would glaze over, I would roll my eyes and say, “I work with big data, what in the world can a silly little AWS lambda function offer me?” I’ve had to eat my own words, those little suckers come in handy in my day to day engineering work. I want to talk about how every data engineer working with AWS can take advantage of lambda’s and add them to their data pipeline tool belt.

June 2, 2021

Big Data, Data, Data Engineering, Data Warehousing

The Elusive Idempotent Data Load/ETL

This is a topic I’ve been musing about lately. The idempotent data load has been a source of much pain and suffering in the lives of many a data engineer and data warehouse developers. Apparently somethings don’t change with the passage of time. My first job in tech was working on a data warehouse team with a classic Kimball style model on SQL Server, back then worrying how to make data loads and ETL idempotent was the task of the hour. All these years later working on data lakes in DataBricks with Spark … guess what …. still worrying about idempotent ETL and data loads.

May 24, 2021

Data, Data Engineering, Data Warehousing

Data Modeling in DeltaLake (DataBricks)

Time to open a can of worms. I’ve recently been working with DataBricks, specifically DeltaLake (which I wrote about here). DeltaLake is an amazing tool that when paired with Apache Spark, is like the juggernaut of Big Data. The old is new, the new is old. The rise of DataBricks and DeltaLake is proof of the age old need for classic Data Warehousing/Data Lakes is as strong as ever. While this Spark+DeltaLakes tech stack is amazing, it’s not your Grandma’s data warehouse, it’s fundamentally different under the hood. One of the topics I’ve been thinking about lately has been data modeling in DeltaLake (on DataBricks or not).

May 10, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

The 3 Types of Data Engineers, Which One Are You?

Every good story starts with a few different characters right? It’s like the spice of life, little bit of this, little bit of that. It’s the way of the world. In all my data wandering I’ve come across lot’s of different types of data engineers. I can usually put them into three different categories, somewhat similar but in many ways quite different.

April 7, 2021

6 Tips for Optimizing Databricks Cluster and Pipeline Performance and Cost

Data Modeling – Relational Databases (SQL) vs Data Lake (File Based)

Databricks vs AWS EMR – Theory and Real Life.

Data Lake vs Data Warehouse – What’s the Dealio?

Databricks vs Snowflake. The DataLake/Warehouse Battle.

Intro to Apache Druid … What is this Devilry

Why Data Engineer’s should use AWS Lambda Functions.

The Elusive Idempotent Data Load/ETL

Data Modeling in DeltaLake (DataBricks)

The 3 Types of Data Engineers, Which One Are You?

Interesting links

Pages

Categories

Archive