May 2024 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Ramblings

Building Data Platforms (from scratch)

Of all the duties that Data Engineers take on during the regular humdrum of business and work, it’s usually filled with the same old, same old. Build new pipeline, update pipeline, new data model, fix bug, etc, etc. It’s never-ending. It’s a constant stream of data, new and old, spilling into our Data Warehouses and Lake Houses with a constant bubbling of some spring-time swollen stream.

What doesn’t happen that often is the infamous “Greenfield Project.” Otherwise known as new stuff … like brand new stuff. Like we don’t have this stuff and now we want this stuff.

I think it’s safe to say there are Greenfield Projects, and then there are Greenfield Projects. It can be hard to know what some people mean, or what their definition of new actually is. Probably just depends on the Data Team, but for the purposes of this discussion let’s assume it means something slightly different … aka you’re building an entirely new, end-to-end, Data Platform … and all that it entails.

Building Data Platforms from scratch.

This is indeed probably the most fun a Data Engineer can have, and the most stressful thing a Data Engineer can do. It’s both. You will learn the most, and cry the most. It will turn a milk toast Engineer into a real one. No hiding. It will either work or it won’t, and since your customers will be Engineers, you will find out right away what you got right, and what you got wrong.

At a high level.

When you’re building something brand new from scratch (yes I’ve done this a few times), it’s easy to get caught up in the details, but this is a bad idea. There is a mantra you should repeat to yourself in the beginning, to keep from getting stuck in the weeds.

“I must understand the big picture and put the peices together, do my due dillegence, there will be bumps in the road, don’t stress the details.”

The classic gotcha I see a lot of younger Engineers make is to think that “it’s all about the code.” They simply cannot get past “but how will we make this one thing work like this.” They get utterly consumed by the code.

Experienced Engineers understand that the code will always work itself out, Engineers WILL find solutions to problems, that’s what they get paid to do. So trying to pre-solve coding problems before they exist will ensure you focus on the wrong things and forget the important things.

At the same time, I fully believe that these projects are best done by Senior+ Engineers and not out-of-touch Architects.

Here is where to start when building Data Platforms from scratch. Of course, there is a lot of hidden detail here, but the point is to break up the building of brand new Data Platforms into manageable parts.

Let’s talk about this a little.

I know I’m beating the proverbial horse to death, but the first thing on the list is important, the order of operations in the above chart. Typically what the Engineers want to do is show up with whatever their bright and shiny solution is, before they even really have documented the needs and requirements. So annoying.

Gather Requirements
Pick your tech and do the Pros and Cons.
- Review this with the team(s)
Sketch out a rough design and do POC of tech if necessary.

The funny part about these few steps is that they are probably the most neglected, heck I’m the same way, I just want to jump with both feet and do something fun. Making sure requirements are gathered, and understood, and that you’ve carefully examined the ins and outs of all the tech can be very tedious.

Who wants to read through mountains of documentation and make lists of this and that, double-checking and cross-checking features, etc? Well, I’m sure there are a few weirdos out there who do like it. Maybe I do a little. Don’t tell anyone.

If you’re going to skip something …

So, if you’re just a normal human who likes to take shortcuts and refuses to do what you’re told, I would suggest at the very least if you’re going to skip this pre-work, that you should AT LEAST do the “Sketch out the rough design and POC if necessary.”

Drawing your new Data Platform out on a piece of paper, even the details will make you think about what you’ve missed, showing you visually what the system is supposed to look like, and bringing out any problem areas.

What if I don’t know where to start?

So what if you’ve never really built a Data Platform from scratch and you don’t even know what you don’t know, where to start, what to do, what to care about … I mean what are the baseline things a Data Platform should consist of and provide?

Never fear, I’ve stumbled and stubbed my toe enough to give you a little insight.

Monitoring and Alerting
CI/CD & DevOps
Orchestration and Dependency Management
Compute and Code
Data Storage and Models

I would say at a minimum, a Data Platform should at least cover these five areas to be considered complete and be able to support a production-ready Data Team, a platform ready for all the problems headed its’ way.

I’m not going to dive into depth on each of these topics, I think merely naming them out loud should be enough for most Data Engineers to get on board with what a Data Platform should consist of. Each one is worthy of a book in it’s own right.

Monitoring and Alerting
- No Data Platform is complete without the ability to visually monitor the data and pipelines underneath, but even more importantly the ability to Alert and send notifications on failures and successes. This tooling will have to integrate well into both the Compute and Code and Orchestration tooling choices.
  - This probably includes Data Quality (a major project in itself)
CI/CD & DevOps
- Most often forgotten by hobbits, there needs to be continuous and automated deployment of infrastructure and code, including testing, dev and prod environments, etc. Extremely important for Developer happiness and quick and solid cycle times.
Orchestration and Dependency Management
- The first major decision that will have a huge impact on the entire Data Platform. What tool will be used to schedule and build the orchestration of all the data and data pipelines inside the Platform? It will probably need to integrate into Monitor and Alerting as well as the Compute platform. Take your time and choose wisely.
Compute and Code
- The second major decision is the core data crunching decision, and of course, will have the single biggest impact on how the code and pipelines will look. Are you going to choose Databricks, Snowflake, Redshift, BigQuery, etc? This will be driven by requirements and the data itself.
Data Storage and Models
- Another often-forgotten aspect of the Data Platform that doesn’t get enough focus off the bat is the storage and data model. Again, this should be considered in minute detail, the data layouts, types, volume, size, and what kind of features are expected and needed. Delta Lake, Hudi, Iceberg, raw files on s3? Decisions have serious consequences.

If you stop and think about each one of these, they are almost a project in themselves, as they should be. Building Data Platforms from scratch is not easy. Each piece connects to and integrates with the other ones, each decision you make on what to use in each area will affect the other areas, the complexity, and how well the Platform runs at scale when built.

AND, we haven’t really talked about the code itself and the specific business use cases that might drive certain decisions based on how your Platform needs to deal with “special” situations specific do your data. The reality is that each business usually has these sorts of caveats.

May 30, 2024

Uncategorized

Why Data Engineering Pays So Well …. For Some, and Poor For Others

If you’ve ever been in the market for a Data Engineering job, or you’re alive and on Linkedin, you’ve probably been constantly inundated with job postings and requests pounding on your emails like a constant mountain stream even bubbling down a hill.

If that’s not the case, then head over to the quarterly salary discussion on r/dataengineering and cruise around the comments for an hour or two. One thing that will become clear quickly is that there is a huge range in pay scales, for apparently the same jobs.

One person is making 190K base and another one on the same stack is making 80K, it’s enough to make you pull your hair out. Heck, I get a constant stream of messages for “great opportunities” for a good 50K less than I make at this moment.

What’s the deal?

Sure, there is always going to be some disparity depending on the person who is the Data Engineer, some people simply deliver twice as much as others and are compensated accordingly. Some have over a decade of experience, some only a few.

Straight to the point.

I will tell you what the deal is. It’s the company(s).

The truth is that this sort of pay disparity exists all across the board, not only in Data Engineering, it’s a human phenomenon, and the Data field is no exception. Companies are simply different and act differently towards employees. Sometimes you’re a widget no one cares about, easily replaceable, and sometimes you end up in a place where people are core to what happens, and are paid accordingly.

Companies that pay well below market value don’t care and probably have bad cultures.
Companies that pay at or above market value are investing in people and have good cultures.
Companies that don’t care about data don’t pay well and have crappy infrastructure and tooling.
Companies that care about data pay well and have great infrastructure and tooling.

A lot of it boils down to two things.

Does this company care about people generally, or not so much? Do they invest in people or see them as liabilities that are easily replaceable?
Does this company care about “data” in a real way? Do they invest in their data?

If you’re a Data Engineer you should find a place to work that cares about people and data, the perfect cross-section. This will maximize your earnings. Although we should not complain as a whole as Data Engineers, salaries generally speaking on average or more than enough to live a good life on.

Overall, data engineers can expect to earn between $96,673 and $130,026 on average annually, with higher earnings possible in certain locations and for those with specialized skills and experience. – AI

How to make more money.

Do you want to make more money, are you underpaid? The answer is simple. Find a new job.

Humans don’t like change, many people get stuck and afraid to move on, and this is what keeps them operating in poor cultures where they are treated poorly, overworked, and underpaid. I’m here to tell you data is still the new oil, even more so with the rise of AI.

You can double you income simply by changing jobs every 1.5 years or so.

Nothing will give you a 20-30% pay bump quicker than changing jobs. You can work hard, even at a good company, and only make that sort of increase after working for a decade. Don’t do it.

Data Engineering pays well AT GOOD Companies. Go get paid.

May 20, 2024

Python

Why You Should Replace Pandas with Polars

I’m still amazed to this day how many folks hold onto stuff they love, they just can’t let it go. I get it, sorta, I’m the same way. There are reasons why people do the things they do, even if they are hard for us to understand. It blows my mind when I see something on r/dataengineering that people are still using SSIS for ETL to this day.

I guess it says something about a piece of technology, whatever it is when it refuses to roll over and die. There’s something to that software.

Yet, even though we may take off our hats and shake the hand of Pandas, it’s the overlord of all Python Data Folk, it still doesn’t mean we can’t move on. Show respect where respect is due, but push those able to move on, to move on into the future. When the future arrives you should consider making the switch.

Replacing Pandas with Polars.

In case you need a more detailed look into how you can actually replace Pandas with Polars, read more here. Really today it’s just a philosophical bunch of thoughts about the reasons behind why people should move on from Pandas to Polars, talk about why they don’t, and how to overcome those things.

Why replace?

Let’s just list some reasons.

Pandas is slow
Pandas can’t work on OOM datasets
Pandas can be cumbersome
Pandas don’t have SQL interface
Polars has SQL interface
Polars is based on Rust (the newest and coolest thing)
Polars can do what Pandas can do, better.
Other tools and general features will be more focused on Polars in the future
At some point, we have to accept that new things are simply better than the old thing

I mean we could go on forever.

What are the reasons people will fail to make the switch from Pandas to Polars (some of these are symptoms of other problems)?

The claim that Pandas is too entrenched in the codebase
Unable to dedicate the time to switch
Afraid of the unfamiliar
A culture that is unable to learn new tools
Claim some minor “thing” they do with Pandas can’t be done in Polars (miss the forest for the trees)

I think it’s more of a cultural thing when it comes to folk who don’t want to switch from Pandas to Polars. It’s probably a business that doesn’t like change, doesn’t deal with tech debt, doesn’t focus enough on Engineering improvements, where the status quo is never challenged.

I’m saying that there aren’t serious hurdles when replacing pieces of technology in any stack, but that doesn’t make it a non-starter. Most good things for us are hard, in life and Engineering. We want stability in our stack, yes, but we don’t want to be the people who are 15 years behind the changes either, we have to strike a balance.

Sometimes things come along like Polars, or Spark, or Snowflake, whatever … we can tell they are here to stay after a year or two. It’s clear the Data Engineering community and Platforms are moving in a direction, so why not move with it? No reason to stay behind and languish.

May 15, 2024

Big Data, Data, Data Engineering

Developing Production Level Databricks Pipelines.

A question that comes up often … “How do I develop Production Level Databricks Pipelines?” Or maybe someone just has a feeling that using Notebooks all day long is expensive and ends up being an unreliable way to produce Databricks Spark + Delta Lake pipelines that run well … without error.

It isn’t really that hard and revolves around a few core ideas.

You must have a good Development Lifecycle
- Local Development
- Local Testing
- Deploy to Development Environment
  - CI/CD
  - Testing
- Deploy to the Production Environment
  - CI/CD
You need to use Docker and Docker-compose
- With Spark and Delta Lake installed + whatever else.
- Run code locally and unit test locally.
You need to invest in CI/CD and auto testing and deployments
- Nothing should be done manually, the entire process automated
- Learn bash and things like CircleCI or GitHub actions

Watch below video for the full details.

May 15, 2024

Data, Data Engineering, Python, Rust

Reading and Processing JSON with Rust vs Python.

Have you ever wondered about being explicit in your code vs being vague? I think about this a lot as I’m writing code on a daily basis. I’ve found I like being explicit and verbose when writing code, rather than being vague in what I’m doing most of the time.