Ramblings Archives - Confessions of a Data Guy

What is the WordPress drama about?

I figured a few of us might need the WordPress drama explained like we are 5. So, here you go.

WordPress is the GOAT of internet website builders
WordPress was founded by Matt Mullenweg
With much of the internet running on WordPress … hosting WordPress is of course … lucrative and a big business.
The founder of WordPress, Matt Mullenweg, is CEO of a company called Automattic
- who runs WordPress.com including hosting
WPEngine is the other big gorilla in the WordPress space.
- hosting platform etc.
There is a lot of money involved
Mullenweg was/is unhappy with WPEngine
- went after WPEngine for being Equity Firm owned, and doing things with WordPress features to “save money,” as well as confusing consumers about the WordPress Trademark based on what is official and what isn’t.
The fight turned very public, and now lawsuits are flying back and forth
The fight is also spilling over into the open-source community, as there are myriad of developers and businesses who’ve built their companies and businesses around WordPress.

It reminds me of the Rust trademark hoopla. The who thing has quickly devolved into what is supposed to be “open-source” software being controlled by money hungry interests who lay claim to trademarks and other “stuff” surrounding brands, who then start telling tons of developers and companies (who’ve been happily doing things for years) that they are now all subject to x, y, z and we will sue you and destroy if you don’t.

People take sides, and the open-source world and all the “things” attached to the “thing” in question descend into chaos.

October 14, 2024

Data, Data Engineering, Ramblings

How to make the PEFECT Pull Request (PR)

Is there anything worse than the PR process (Pull Request) at most companies? Probably not. It’s the dreaded 600-pound gorilla in the room that no one wants to talk about. Everyone hates it, everyone has to do it. But, it doesn’t have to be like that.

There are a few tried and true ways to make the perfect PR that takes all your problems away. Checkout the video for more.

October 12, 2024

Data, Data Engineering, Ramblings

Hosted (SaaS) vs DIY Data Tools

I’ve been hacking around with tools and programming since Perl was a thing. I’ve worked the gambit of Data Platforms from large organizations to tiny startups, and all those in between. I’ve worked on Data Platforms that dropped ungodly amounts of money on SAP products, and places where we would build our own massive data processing platforms on Kubernetes.

Each to their own I guess.

October 3, 2024

AI, Big Data, Data, Data Engineering, Ramblings

AI (LLMs) and Software Engineering (Writing Code)

I recently wrote on my Substack (Data Engineering Central) about how I used the new OpenAI o1 model to do some basic Data Engineering tasks surrounding PostgreSQL. It did ok. I’ve also been using CoPilot and ChatGPT for over a year now to assist me with my daily code that I have to write for one reason or another.

September 24, 2024

Data, Data Engineering, Ramblings

What is a “Good” Data or Software Engineer?

Recently, for some unknown reason, I was pursuing the new Stackoverflow … called Reddit, for Data Engineering … and I ran across an interesting question … more or less it was related to “what makes a good Software Engineer … in a Data Engineering context.”

August 20, 2024

Big Data, Data, Data Engineering, Ramblings

How to Solve Data Engineering Problems

One thing I find myself doing these days (I am unsure how I feel about this), is teaching others to solve problems … Data Engineering problems to be specific. It’s not a hard stretch for most to imagine that what a person does at Senior+ software-type levels is just write good code all day.

I assure you, this is not the case typically.

August 7, 2024

Big Data, Data, Data Engineering, Ramblings

The Abstractions Are Making You Dumb (rise of the Shallow Expert)

When I was young and full of myself, writing Perl and PHP, while your ma was still reading you a bedtime story and giving you a stuffy to fall asleep with, I had to program uphill, both ways, in the rain and snow. Not like you milk toast Data Engineers clickty clicking around Databricks and Snowflake UIs.

You want a server? Spin up your own Apache. Need a database? MySQL was the only game in town. Need a backend language? Perl was the cat’s meow.

July 10, 2024

Data Warehousing, Ramblings

Databricks Buys Tabular – 1 Billion Dollar Deal. Iceberg vs Delta Lake?

The battle for the Data Warehouse, Data Lake, Lake House, or whatever you want to call it, in the age of AI just got more interesting. In an unsurprising move, Databricks has announced plans to buy Tabular for 1 billion dollars, beating out Snowflake who was reportedly trying to do the same thing.

June 4, 2024

Big Data, Data, Data Engineering, Ramblings

Building Data Platforms (from scratch)

Of all the duties that Data Engineers take on during the regular humdrum of business and work, it’s usually filled with the same old, same old. Build new pipeline, update pipeline, new data model, fix bug, etc, etc. It’s never-ending. It’s a constant stream of data, new and old, spilling into our Data Warehouses and Lake Houses with a constant bubbling of some spring-time swollen stream.

What doesn’t happen that often is the infamous “Greenfield Project.” Otherwise known as new stuff … like brand new stuff. Like we don’t have this stuff and now we want this stuff.

I think it’s safe to say there are Greenfield Projects, and then there are Greenfield Projects. It can be hard to know what some people mean, or what their definition of new actually is. Probably just depends on the Data Team, but for the purposes of this discussion let’s assume it means something slightly different … aka you’re building an entirely new, end-to-end, Data Platform … and all that it entails.

Building Data Platforms from scratch.

This is indeed probably the most fun a Data Engineer can have, and the most stressful thing a Data Engineer can do. It’s both. You will learn the most, and cry the most. It will turn a milk toast Engineer into a real one. No hiding. It will either work or it won’t, and since your customers will be Engineers, you will find out right away what you got right, and what you got wrong.

At a high level.

When you’re building something brand new from scratch (yes I’ve done this a few times), it’s easy to get caught up in the details, but this is a bad idea. There is a mantra you should repeat to yourself in the beginning, to keep from getting stuck in the weeds.

“I must understand the big picture and put the peices together, do my due dillegence, there will be bumps in the road, don’t stress the details.”

The classic gotcha I see a lot of younger Engineers make is to think that “it’s all about the code.” They simply cannot get past “but how will we make this one thing work like this.” They get utterly consumed by the code.

Experienced Engineers understand that the code will always work itself out, Engineers WILL find solutions to problems, that’s what they get paid to do. So trying to pre-solve coding problems before they exist will ensure you focus on the wrong things and forget the important things.

At the same time, I fully believe that these projects are best done by Senior+ Engineers and not out-of-touch Architects.

Here is where to start when building Data Platforms from scratch. Of course, there is a lot of hidden detail here, but the point is to break up the building of brand new Data Platforms into manageable parts.

Let’s talk about this a little.

I know I’m beating the proverbial horse to death, but the first thing on the list is important, the order of operations in the above chart. Typically what the Engineers want to do is show up with whatever their bright and shiny solution is, before they even really have documented the needs and requirements. So annoying.

Gather Requirements
Pick your tech and do the Pros and Cons.
- Review this with the team(s)
Sketch out a rough design and do POC of tech if necessary.

The funny part about these few steps is that they are probably the most neglected, heck I’m the same way, I just want to jump with both feet and do something fun. Making sure requirements are gathered, and understood, and that you’ve carefully examined the ins and outs of all the tech can be very tedious.

Who wants to read through mountains of documentation and make lists of this and that, double-checking and cross-checking features, etc? Well, I’m sure there are a few weirdos out there who do like it. Maybe I do a little. Don’t tell anyone.

If you’re going to skip something …

So, if you’re just a normal human who likes to take shortcuts and refuses to do what you’re told, I would suggest at the very least if you’re going to skip this pre-work, that you should AT LEAST do the “Sketch out the rough design and POC if necessary.”

Drawing your new Data Platform out on a piece of paper, even the details will make you think about what you’ve missed, showing you visually what the system is supposed to look like, and bringing out any problem areas.

What if I don’t know where to start?

So what if you’ve never really built a Data Platform from scratch and you don’t even know what you don’t know, where to start, what to do, what to care about … I mean what are the baseline things a Data Platform should consist of and provide?

Never fear, I’ve stumbled and stubbed my toe enough to give you a little insight.

Monitoring and Alerting
CI/CD & DevOps
Orchestration and Dependency Management
Compute and Code
Data Storage and Models

I would say at a minimum, a Data Platform should at least cover these five areas to be considered complete and be able to support a production-ready Data Team, a platform ready for all the problems headed its’ way.

I’m not going to dive into depth on each of these topics, I think merely naming them out loud should be enough for most Data Engineers to get on board with what a Data Platform should consist of. Each one is worthy of a book in it’s own right.

Monitoring and Alerting
- No Data Platform is complete without the ability to visually monitor the data and pipelines underneath, but even more importantly the ability to Alert and send notifications on failures and successes. This tooling will have to integrate well into both the Compute and Code and Orchestration tooling choices.
  - This probably includes Data Quality (a major project in itself)
CI/CD & DevOps
- Most often forgotten by hobbits, there needs to be continuous and automated deployment of infrastructure and code, including testing, dev and prod environments, etc. Extremely important for Developer happiness and quick and solid cycle times.
Orchestration and Dependency Management
- The first major decision that will have a huge impact on the entire Data Platform. What tool will be used to schedule and build the orchestration of all the data and data pipelines inside the Platform? It will probably need to integrate into Monitor and Alerting as well as the Compute platform. Take your time and choose wisely.
Compute and Code
- The second major decision is the core data crunching decision, and of course, will have the single biggest impact on how the code and pipelines will look. Are you going to choose Databricks, Snowflake, Redshift, BigQuery, etc? This will be driven by requirements and the data itself.
Data Storage and Models
- Another often-forgotten aspect of the Data Platform that doesn’t get enough focus off the bat is the storage and data model. Again, this should be considered in minute detail, the data layouts, types, volume, size, and what kind of features are expected and needed. Delta Lake, Hudi, Iceberg, raw files on s3? Decisions have serious consequences.

If you stop and think about each one of these, they are almost a project in themselves, as they should be. Building Data Platforms from scratch is not easy. Each piece connects to and integrates with the other ones, each decision you make on what to use in each area will affect the other areas, the complexity, and how well the Platform runs at scale when built.

AND, we haven’t really talked about the code itself and the specific business use cases that might drive certain decisions based on how your Platform needs to deal with “special” situations specific do your data. The reality is that each business usually has these sorts of caveats.

May 30, 2024

Python, Ramblings

Google Fires Python. What Next?

What is going on? Is the world coming to an end? I thought Python was going to live forever. Well, apparently not at Google. Recently Google announced it was laying off its entire North American-based Python team that was supporting Google’s special needs with Python, in favor of cheaper offshore workers.

Apparently, some of these Engineers were GOAT-level employees.

Is that really that much of a surprise? Probably not. Ever since Bard, Google has been failing hard. They still make a ton of money but old Google has gone by-by. It’s a new Era of Corporate Google, and it’s here to stay.

Heck, you can still make bank working in Software at Google … but you better save some of those greens for when your name gets called.

April 30, 2024

What is the WordPress drama about?

How to make the PEFECT Pull Request (PR)

Hosted (SaaS) vs DIY Data Tools

AI (LLMs) and Software Engineering (Writing Code)

What is a “Good” Data or Software Engineer?

How to Solve Data Engineering Problems

The Abstractions Are Making You Dumb (rise of the Shallow Expert)

Databricks Buys Tabular – 1 Billion Dollar Deal. Iceberg vs Delta Lake?

Building Data Platforms (from scratch)

Building Data Platforms from scratch.

At a high level.

If you’re going to skip something …

What if I don’t know where to start?

Google Fires Python. What Next?

Interesting links

Pages

Categories

Archive