When I was young and full of myself, writing Perl and PHP, while your ma was still reading you a bedtime story and giving you a stuffy to fall asleep with, I had to program uphill, both ways, in the rain and snow. Not like you milk toast Data Engineers clickty clicking around Databricks and Snowflake UIs.
You want a server? Spin up your own Apache. Need a database? MySQL was the only game in town. Need a backend language? Perl was the cat’s meow.
Of all the duties that Data Engineers take on during the regular humdrum of business and work, it’s usually filled with the same old, same old. Build new pipeline, update pipeline, new data model, fix bug, etc, etc. It’s never-ending. It’s a constant stream of data, new and old, spilling into our Data Warehouses and Lake Houses with a constant bubbling of some spring-time swollen stream.
What doesn’t happen that often is the infamous “Greenfield Project.” Otherwise known as new stuff … like brand new stuff. Like we don’t have this stuff and now we want this stuff.
I think it’s safe to say there are Greenfield Projects, and then there are Greenfield Projects. It can be hard to know what some people mean, or what their definition of new actually is. Probably just depends on the Data Team, but for the purposes of this discussion let’s assume it means something slightly different … aka you’re building an entirely new, end-to-end, Data Platform … and all that it entails.
Building Data Platforms from scratch.
This is indeed probably the most fun a Data Engineer can have, and the most stressful thing a Data Engineer can do. It’s both. You will learn the most, and cry the most. It will turn a milk toast Engineer into a real one. No hiding. It will either work or it won’t, and since your customers will be Engineers, you will find out right away what you got right, and what you got wrong.
At a high level.
When you’re building something brand new from scratch (yes I’ve done this a few times), it’s easy to get caught up in the details, but this is a bad idea. There is a mantra you should repeat to yourself in the beginning, to keep from getting stuck in the weeds.
“I must understand the big picture and put the peices together, do my due dillegence, there will be bumps in the road, don’t stress the details.”
The classic gotcha I see a lot of younger Engineers make is to think that “it’s all about the code.” They simply cannot get past “but how will we make this one thing work like this.” They get utterly consumed by the code.
Experienced Engineers understand that the code will always work itself out, Engineers WILL find solutions to problems, that’s what they get paid to do. So trying to pre-solve coding problems before they exist will ensure you focus on the wrong things and forget the important things.
At the same time, I fully believe that these projects are best done by Senior+ Engineers and not out-of-touch Architects.
Here is where to start when building Data Platforms from scratch. Of course, there is a lot of hidden detail here, but the point is to break up the building of brand new Data Platforms into manageable parts.
Let’s talk about this a little.
I know I’m beating the proverbial horse to death, but the first thing on the list is important, the order of operations in the above chart. Typically what the Engineers want to do is show up with whatever their bright and shiny solution is, before they even really have documented the needs and requirements. So annoying.
- Gather Requirements
- Pick your tech and do the Pros and Cons.
- Review this with the team(s)
- Sketch out a rough design and do POC of tech if necessary.
The funny part about these few steps is that they are probably the most neglected, heck I’m the same way, I just want to jump with both feet and do something fun. Making sure requirements are gathered, and understood, and that you’ve carefully examined the ins and outs of all the tech can be very tedious.
Who wants to read through mountains of documentation and make lists of this and that, double-checking and cross-checking features, etc? Well, I’m sure there are a few weirdos out there who do like it. Maybe I do a little. Don’t tell anyone.
If you’re going to skip something …
So, if you’re just a normal human who likes to take shortcuts and refuses to do what you’re told, I would suggest at the very least if you’re going to skip this pre-work, that you should AT LEAST do the “Sketch out the rough design and POC if necessary.”
Drawing your new Data Platform out on a piece of paper, even the details will make you think about what you’ve missed, showing you visually what the system is supposed to look like, and bringing out any problem areas.
What if I don’t know where to start?
So what if you’ve never really built a Data Platform from scratch and you don’t even know what you don’t know, where to start, what to do, what to care about … I mean what are the baseline things a Data Platform should consist of and provide?
Never fear, I’ve stumbled and stubbed my toe enough to give you a little insight.
- Monitoring and Alerting
- CI/CD & DevOps
- Orchestration and Dependency Management
- Compute and Code
- Data Storage and Models
I would say at a minimum, a Data Platform should at least cover these five areas to be considered complete and be able to support a production-ready Data Team, a platform ready for all the problems headed its’ way.
I’m not going to dive into depth on each of these topics, I think merely naming them out loud should be enough for most Data Engineers to get on board with what a Data Platform should consist of. Each one is worthy of a book in it’s own right.
- Monitoring and Alerting
- No Data Platform is complete without the ability to visually monitor the data and pipelines underneath, but even more importantly the ability to Alert and send notifications on failures and successes. This tooling will have to integrate well into both the Compute and Code and Orchestration tooling choices.
- This probably includes Data Quality (a major project in itself)
- No Data Platform is complete without the ability to visually monitor the data and pipelines underneath, but even more importantly the ability to Alert and send notifications on failures and successes. This tooling will have to integrate well into both the Compute and Code and Orchestration tooling choices.
- CI/CD & DevOps
- Most often forgotten by hobbits, there needs to be continuous and automated deployment of infrastructure and code, including testing, dev and prod environments, etc. Extremely important for Developer happiness and quick and solid cycle times.
- Orchestration and Dependency Management
- The first major decision that will have a huge impact on the entire Data Platform. What tool will be used to schedule and build the orchestration of all the data and data pipelines inside the Platform? It will probably need to integrate into Monitor and Alerting as well as the Compute platform. Take your time and choose wisely.
- Compute and Code
- The second major decision is the core data crunching decision, and of course, will have the single biggest impact on how the code and pipelines will look. Are you going to choose Databricks, Snowflake, Redshift, BigQuery, etc? This will be driven by requirements and the data itself.
- Data Storage and Models
- Another often-forgotten aspect of the Data Platform that doesn’t get enough focus off the bat is the storage and data model. Again, this should be considered in minute detail, the data layouts, types, volume, size, and what kind of features are expected and needed. Delta Lake, Hudi, Iceberg, raw files on s3? Decisions have serious consequences.
If you stop and think about each one of these, they are almost a project in themselves, as they should be. Building Data Platforms from scratch is not easy. Each piece connects to and integrates with the other ones, each decision you make on what to use in each area will affect the other areas, the complexity, and how well the Platform runs at scale when built.
AND, we haven’t really talked about the code itself and the specific business use cases that might drive certain decisions based on how your Platform needs to deal with “special” situations specific do your data. The reality is that each business usually has these sorts of caveats.
A question that comes up often … “How do I develop Production Level Databricks Pipelines?” Or maybe someone just has a feeling that using Notebooks all day long is expensive and ends up being an unreliable way to produce Databricks Spark + Delta Lake pipelines that run well … without error.
It isn’t really that hard and revolves around a few core ideas.
- You must have a good Development Lifecycle
- Local Development
- Local Testing
- Deploy to Development Environment
- CI/CD
- Testing
- Deploy to the Production Environment
- CI/CD
- You need to use Docker and Docker-compose
- With Spark and Delta Lake installed + whatever else.
- Run code locally and unit test locally.
- You need to invest in CI/CD and auto testing and deployments
- Nothing should be done manually, the entire process automated
- Learn bash and things like CircleCI or GitHub actions
Watch below video for the full details.
A few years ago I wasn’t sure, who was going to win, Golang seemed to be popular, and still is for that matter. When I first wrote a little Golang (~2+ years ago) I was just trying to see what the hype was all about. The funny thing is, at the time, and today, it seems like the Golang syntax is much simpler than Rust, easier to learn and pick up by far.
I never thought I would live to see the day, it’s crazy. I’m not sure who’s idea it was to make it possible to write Apache Spark with Rust, Golang, or Python … but they are all genius.
As of Apache Spark 3.4 it is now possible to use Spark Connect … a thin API client on a Spark Cluster ontop of the DataFrame API.
You can now connect backend systems and code, using Rust or Golang etc, to a Spark Server and run commands and get results remotely. Simply amazing. A new era of tools and products is going to be unleashed on us. We are no longer chained to the JVM. The walls have been broken down. The future is bright.
I always leave it to my dear readers and followers to give me pokes in the right direction. Nothing like the teaming masses to set you straight. Recently I was working on my Substack Newsletter, on the topic of Polars + Delta Lake, reading remove files from s3 … I left a question open on my LinkedIn account.
I had someone jog my leaky memory in favor of DuckDB. I haven’t touched DuckDB in some time, and I’m sure it’s under heavy development what with that Mother Duck and all.
So, it’s time to talk about DuckDB + Delta Lake.
Sometimes it seems like the Data Engineering landscape is starting to shoot off into infinity. With the rise of Rust, new tools like DuckDB, Polars, and whatever else, things do seem to shifting at a fundamental level. It seems like there is someone at the base of a titering rock with a crowbar, picking and prying away, determined to spill tools like Java, Scala, Python, Spark, and Airflow, the things we’ve known and loved for years, from their lofty thrones.
Maybe they all have had their time in the Data Engineering sun, maybe it’s time to shake things up. It seems to be happening. It’s always hard to have those we hold dear be poked and prodded at. I’ve been using Spark since before it was cool, so when I started to hear the word Ballista start to show up here and there, I took note.
Besides, I’ve been dabbling my grubby little fingers in Rust for some months now, and have seen The Light. Is it possible I could be living at the dawn of a new era? A new and exciting frontier of Data Engineering, finally, after all this time? Could Rust really take over? Will something like Ballista pull that old Spark from its distributed processing tower and claim its rightful place?
I’ve been a dog licking my wounds for some time now. Over on my Substack newsletter, I’ve been doing a small series on DSA (Data Structures and Algorithms). I tackled some of the easier stuff first, like Linked Lists, Binary Search, and the like. What’s more, I actually did most of it in Rust, since I’ve possibly, maybe slightly, every so slightly, fallen in love with Rust.
Like most relationships, it vacillates between pure adoration and utter hatred, depending on the problem at hand. When I did a recent article on Graphs, Queues, and BSF, I attempted it in Rust, and was struck a mighty blow, that borrow checker had me down. It seemed doable, but at the time, under time pressure to get the Newsletter out, I reverted to Python and moved on.
Alas, I’m back again, a glutton for punishment. This time I thought I should try another crack at parsing a graph with Rust, but in a real-life situation, no more made-up stuff. Actual data, actual graph, here we go. All code is on GitHub.
Interesting links
Here are some interesting links for you! Enjoy your stay :)Pages
Categories
Archive
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- May 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018