Big Data Archives - Page 5 of 11 - Confessions of a Data Guy

Introduction to Designing Data Load Patterns.

When I think back many moons ago, to when I started in Data Engineering world … even though it went by many different names back in the olden days … I didn’t know what I didn’t know. All those years ago Kimball’s Data Warehouse Toolkit was probably the only resource really available at the time that touched on the general concepts that most “Data Engineers” at the time were working on. The field has come a long way since those days and changed for the better, it’s less often you see classic Data Warehouses running on legacy SQL Servers, with stored procedures with hundreds and thousands of lines of SQL code.

That had me thinking about designing data load patterns in the Modern Data Stack. I want to talk about general data loading patterns, how to design your data pipelines, at a high level, and the basic principles and practices that apply to 99% of all the transformations and data loads done by most Data Engineers.

September 27, 2022

Big Data, Data, Data Engineering, Machine Learning, Ramblings

Machine Learning from the viewpoint of an average Data Engineer.

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

September 26, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Soda-Core. Data Quality at Scale.

Ever since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…

“Data reliability testing for SQL- and Spark- accesssible data.“
soda-core docs

Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.

September 14, 2022

Big Data, Data, Data Engineering, Data Warehousing, Rust

DataFusion courtesy of Rust, vs Spark. Performance and other thoughts.

I think it’s funny that DataFrames are so popular these days, I mean for good reason. They are a wonderful and intuitive way to work with and on datasets. Pandas … the nemesis of all Data Engineers and the lover of Data Scientists. Apache Spark is really the beast that brought DataFrames to the masses. Even those little buggers over at Apache Beam give you DataFrames.

Of course, when anything gets popular, you start getting little things that start to pick and peck at the heels. I would probably say that is what DataFusion with Rust seems to be. Seems more like a contender against Pandas rather than Spark to me. I guess if you’re just using Spark locally or on a single node, sure you could consider using DataFusion. Code available on GitHub.

September 11, 2022

Big Data, Data, Data Engineering, Ramblings

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

In the beginning, I always thought the humdrum Big O Notation discussions should be reserved for Software Engineers who enjoyed working on such things. I mean, what could it possibly have to do with Data Engineering? I mean, if you are the person writing the Spark application, by all means, have at it, but if you are the Data Engineer who is simply using Spark, why can’t you just leave the details to the Devil? Seems to make sense.

The only problem with that logic is the longer you work as a Data Engineer, probably the harder the problems you work on become, you write more and more code, and basically end up being a specialized Software Engineer … even if you don’t want to be. In the end, to be a good Data Engineer you should at least attempt to understand the concepts behind Big O Notation, and how those concepts can apply to you as Data Engineer, especially for the ETL that most of us write.

August 15, 2022

Big Data, Data, Data Engineering, Data Quality, SQL

You Only Need 2 Data Validations, That’s It.

I mean, I’m sort of being facetious and sort of not. I mean there is some truth that rings out in those words. I’m sure someone selling Data Observability tools, or writing Great Expectations all-day will not like the idea of relying on only 2 data validations. But honestly, these two are probably more than 80% of Data Teams are using today for validation, which is none. What 2 are you? Glad you asked.

August 1, 2022

Big Data, Data, Data Engineering, Data Warehousing

Exploring Delta Lake’s ZORDER, and Performance. On Databricks.

I think Delta Lake is here to stay. With the recent news that Databricks is open-sourcing the full feature-set of Delta Lake, instead of keeping the best stuff for themselves, it probably has the most potential to be the number one go-to for the future of Data Lakes, especially within those organizations that are heavy Spark users.

One of the best parts about Delta Lake is that it’s easy to use, yet it has a rich feature set, making it a powerful option for Big Data storage and modeling. One of those features that promise a lot of performance benefits is something called ZORDER. Today I want to explore more in-depth what ZORDER is, when to use it, when not to use it, and most importantly test its performance during a number of common Spark operations.

July 7, 2022

Big Data, Data, Data Engineering, Python

A Few Wonderful PySpark Features.

Just when I think it cannot get more popular, it does. I have to admit, PySpark is probably the best thing that ever happened to Big Data. It made what was once a myth, approachable to the average person. No need for esoteric Java skills, no more MapReduce, just plain old Python. Another amazing thing about Spark in general, and by extension PySpark, is the sheer amount of out-of-the-box capabilities. I wanted to dedicate this post to a few amazing and wonderful features of PySpark that make Data Engineering fun and powerful.

June 24, 2022

Big Data, Data, Data Engineering, Ramblings

Quick Guide to Data Engineering on AWS

June 9, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

It still seems like the wild west of Data Quality these days. Tools like Apache Deque are just too much for most folks, and Data Quality is still new enough to the scene as a serious thought topic that most tools haven’t matured that much, and companies dropping money on some tool is still a little suspect. I’ve probably heard more about Great Expectations as a DQ tool than most.

With the popularity of PySpark as a Big Data tool, and Great Expectations coming into its own, I’ve been meaning to dive into what it would actually look like to to use Great Expectations at scale and answer some simple questions. How easy is it to get up and running with Spark, what’s the path of least resistance to getting some basic Data Quality checks in place in a data pipeline.

June 2, 2022

Introduction to Designing Data Load Patterns.

Machine Learning from the viewpoint of an average Data Engineer.

Soda-Core. Data Quality at Scale.

DataFusion courtesy of Rust, vs Spark. Performance and other thoughts.

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

You Only Need 2 Data Validations, That’s It.

Exploring Delta Lake’s ZORDER, and Performance. On Databricks.

A Few Wonderful PySpark Features.

Quick Guide to Data Engineering on AWS

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

Interesting links

Pages

Categories

Archive