Data Engineering Archives - Page 11 of 23 - Confessions of a Data Guy

Data, Data Engineering, Data Quality, Ramblings

A Diatribe against Data Contracts and their Abuses.

Ok, so I don’t really mean all that. Or do I? I have no idea what the future holds. Sometimes it’s easy to pick out the winners, like Databricks and Snowflake, you can see, feel, and taste the results of those data products, a delicious and delectable bounty to feast upon. Other things are harder to read the tea leaves on. Kinda like Data Mesh … is it a thing, or is it not a thing? It’s hard to decern between charlatans and marketing/sales departments hocking the next Cure All Snake Oil and real life.

What about all this recent humdrum and buzz around Data Contracts? Pushed by some popular Data Engineering faces like Ananth Packkildurai and Chad Sanderson. What is all the hype about Data Contracts, are folks just pushing another tool down our throats? Is there a real issue and problem that can be solved with Data Contracts?

November 16, 2022

Big Data, Data, Data Engineering, Data Warehousing

Introduction to Historical Loads – for Data Engineers.

There are probably few things in life that will strike more fear and tumult in the heart of the Data Engineer than historical loads. You know, on the surface it seems like such an innocent thing. How could it possibly be, just take a bunch of data stored somewhere and shove it into a table. If only. Life never works that way, and neither does the historical load. You would think after decades we all would have figured it out you know. Is it because we don’t do it enough? Maybe it’s like regex, you just figure it out as you go every single time, telling yourself you’ll do it right next time.

November 5, 2022

Big Data, Data, Data Engineering, Data Warehousing, Rust

Delta Lake without Spark (delta-rs). Innovation, cost savings, and other such matters.

The intersection of Big Data and Not Big Data.

An interesting topic of late that has been rattling around in my overcrowded head is the idea of Big Data vs Not Big Data, and the intersection thereof. I’ve been thinking about SAAS vendors, the Modern Data Stack, costs, and innovation. A great real-life example of all these topics is Delta Lake. Delta Lake is the child of Databricks, officially or not, and at a minimum has exploded in usage because of the increasing usage of Databricks and the popularity of Data Lakes.

Delta Lake, Hudi, Iceberg, all these ACID/CRUD abstractions on top of storage for Big Data have been game changers. But, as with any new popular tech, it comes with its own set of challenges. Specifically for Delta Lake … if you want to use it 99.9% of people are going to have to use Spark to do so, which can be costly, in terms of running clusters, and add complexity, in terms of new tooling, data pipelines, and the like. Anytime you only have one path to take with a tool, innovation is stifled, and barriers arise. Enter delta-rs the Standalone Rust API for Delta.

October 25, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

New to PySpark? Do this, not that.

Do this, not that. Well, I’ve got my own list. With everyone jumping on the PySpark / Databricks / EMR / Glue / Whatever bandwagon I thought it was long overdue for a post on what to do, and not to do when working with Spark / PySpark. I take the pragmatic approach to working with Spark, it’s honestly very forgiving well and far into the 10s of TBs of data. Once you wander past that point things tend to get a little spicy if you don’t have it all dialed in. As with most things in life if you get a few things right, and of course don’t do some things, that will get you a long way, the same applies to Spark.

October 13, 2022

Data, Data Engineering, Ramblings

5 Years of Blogging – Most Popular Articles, Traffic Stats, and other Thoughts.

Sometimes I feel like I’ve been doing this too long, life gets busy, and I don’t have much to say … but here I am 5 years later. I’m still making people mad and making a fool of myself, some things never change. This will probably be short and sweet. I will cover the top 10 most popular blog posts from those 5 years, what the traffic has looked like over time, and what I’ve learned from writing blogs for so long, the good, the bad, and the ugly.

October 7, 2022

Big Data, Data, Data Engineering, Data Warehousing, SQL

Introduction to dbt … for Data Engineers.

So, you’ve heard about dbt have you. I honestly can’t decide if it’s here to stay or not, probably is, enough folks are using it, and preaching about it. I personally have always been a little skeptical of dbt, not because it can’t do what it says it can do, it can, but because I’m old and bitter from my many years of Data Engineering, and I always see the problems in things.

But, I will let you judge that for yourself. Today I want to give a brief overview of dbt, kick the tires, muse about its features, and most importantly, look at dbt from a Data Engineering perspective, ferret out the good, the bad, and the ugly. I will try my best to be nice but don’t count on it. Code is on GitHub.

October 4, 2022

Big Data, Data, Data Engineering

Introduction to Designing Data Load Patterns.

When I think back many moons ago, to when I started in Data Engineering world … even though it went by many different names back in the olden days … I didn’t know what I didn’t know. All those years ago Kimball’s Data Warehouse Toolkit was probably the only resource really available at the time that touched on the general concepts that most “Data Engineers” at the time were working on. The field has come a long way since those days and changed for the better, it’s less often you see classic Data Warehouses running on legacy SQL Servers, with stored procedures with hundreds and thousands of lines of SQL code.

That had me thinking about designing data load patterns in the Modern Data Stack. I want to talk about general data loading patterns, how to design your data pipelines, at a high level, and the basic principles and practices that apply to 99% of all the transformations and data loads done by most Data Engineers.

September 27, 2022

Big Data, Data, Data Engineering, Machine Learning, Ramblings

Machine Learning from the viewpoint of an average Data Engineer.

I’ve been thinking more about the topic of ML and MLOps lately. To me, it seems like the buzz has quieted down over the last few years about ML and MLOps, at least somewhat, in favor of other topics like Data Quality, Data Lakes, Data Contracts, and the like. I’ve been wondering why this is the case and comparing my experience over the last few years of working in, on, and around ML pipelines and systems. I’ve seen ML done at companies with a few thousand employees, and with a handful of employees. The problems and hurdles at the same across the board, and mostly everyone is not very good at it.

September 26, 2022

Data, Data Engineering, Golang

Gandalf’s Beard! DataFrames in Golang.

Photo by Thomas Schweighofer on Unsplash

I’m not sure if DataFrames in Golang were created by Gandalf or by Saruman, it is still unclear to me. I mean, if I want a DataFrame that bad … why not just use something normal like Python, Spark, or pretty much anything else but Golang. But, I mean if Rust gets DataFusion, then Golang can’t be left out to dry, can it!? I mean I guess if you’re hardcore Golang and nothing else will do, and you’re playing around with CSV files, then maybe? Seems like kind of a stretch. But, I have a hard time saying no to Golang, it’s just so much fun. Kinda like when Gandalf told them little hobbits and dwarfs to not stray from the path going through Fangorn Forest, those little buggers did it anyways. Code available on GitHub.

September 22, 2022

Data, Data Engineering, Ramblings

8 Data Engineering Best Practices

Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.

September 21, 2022

A Diatribe against Data Contracts and their Abuses.

Introduction to Historical Loads – for Data Engineers.

Delta Lake without Spark (delta-rs). Innovation, cost savings, and other such matters.

The intersection of Big Data and Not Big Data.

New to PySpark? Do this, not that.

5 Years of Blogging – Most Popular Articles, Traffic Stats, and other Thoughts.

Introduction to dbt … for Data Engineers.

Introduction to Designing Data Load Patterns.

Machine Learning from the viewpoint of an average Data Engineer.

Gandalf’s Beard! DataFrames in Golang.

8 Data Engineering Best Practices

Interesting links

Pages

Categories

Archive