What is a Data Platform?
You know, for all the hoards of content, books, and videos produced in the “Data Space” over the last few years, famous or others, it seems I find there are volumes of information on the pieces and parts of working in Data. It could be Data Quality, Data Modeling, Data Pipelines, Data Storage, Compute, and the list goes on. I found this to be a problem as I was growing in my “Data” career over the decades.
While it’s good to become experts in all these areas—I guess we need to be—it’s easy to lose sight of the forest through the trees. You can get so caught up in learning how to write Spark that you forget Databricks is a whole ecosystem of features and products. You can also get so caught up in S3 or AWS Lambda that you forget about the infrastructure as a whole.
What I’m trying to convey is that the best Data Practitioners can see the Data Platform as a whole.
They can build Data Platforms from scratch, a skill that isn’t taught much. In the defense of people as a whole, it’s an extremely hard skill to learn. That’s why it’s usually done by Architects, Staff, or Principal Engineers. It takes a lifetime of experience for someone to understand a Data Platform from start to finish.
It’s one thing to be a “Senior” Engineer who can write Airflow DAGs with your eyes closed and Databricks Spark pipelines while you yawn in the morning. It’s another thing entirely to be told to go build a brand new Data Platform to replace AWS EMR and raw Parquets stored in S3.
You have to be able to answer …
What exactly IS a Data Platform?
Let’s ask old ChatGPT this question and see what is says.
“A data platform is an integrated set of technologies and tools that enables the collection, storage, processing, management, and analysis of data. It serves as the foundation for organizations to derive insights, build data-driven applications, and support various analytics and machine learning initiatives.” – ChatGPT
The explosion of the Modern Data Stack has made this question almost impossible to answer, because with the influx of new ideas and technologies, it’s hard to get your hands around the endless stream of things that never stops. But, I think we should try to distill down into a few bite sized pieces what are the fundamental structures of a Data Platform, knowing that there is variance in the use-cases.
- Architectural Foundations & Infrastructure
- The big picture of how an entire Data Stack puzzle pieces would work together as a whole.
- Data Ingestion & Integration Strategies
- The source data and glue for all the data coming in and leaving the Data Platform
- Storage & Modeling for the Lake House
- What sort of Storage system? Delta Lake, Iceberg, Hudi? How does it fit into the bigger picture?
- Data Governance, Quality & Cataloging
- How do you control permissions, Production vs Development, compliance, etc?
- Data Transformation & Processing Frameworks
- What is your going to be your main data processing tool(s)? Spark, Glue, DuckDB, Pandas, Airflow etc?
- Performance, Scalability & Cost Optimization
- Do you understand your compute and storage costs? How about managed tooling costs? Does your architecture scale easily, what are the expected runtimes and SLAs?
- Monitoring, Observability & Ongoing Maintenance
- How do you notify and monitor success and failure of the systems? Slack, email, Airflow, etc?
- Running Data Teams, Culture, and Tech.
- What kind of team and culture do you have, how do you handle PRs, Design Reviews, what are the expectations?
- Machine Learning & Advanced Analytics Integration
- Do you have MLOps needs, what about modeling serving and APIs, how do you do feature engineering and experiment tracking?
I mean jeez … isn’t that boat load of stuff? How can one person know all these things? It is a hard thing if you simply don’t have at least some experience in each one of these spaces, it’s hard to know what you don’t know. But, in the end, it is what it is. Each and every one of those things in the list, and more, combine into a single Data Platform.
A Data Platform that is missing one or more of those points, or fails in one or more of those points is going to be a Data Platform that is not well received or used, one that does NOT provide Business Value, but actually is a byword and a hinderance to the business and end clients.
This is why GOOD Data Platforms are so rare. Sure, maybe they use the latest technology like Databricks or Snowflake and it’s a beautiful Lake House architecture, but it’s impossible to test anything, there is no Development environment, or the code always breaks because there are no tests, or maintenance and feature additions are a pain.
Trust me, those Engineering pain points will indeed bleed over into the rest of the business when they are asking simple questions and never get answered, or things break, or data is late or missing. Then all the new tools in the world will not save you from the ire of the C-suite. It’s also a lot of balls to juggle at once when building or fixing up a broken Data Platform. There isn’t really any piece of the puzzle that you can shove into the background, each piece is critical to the overall function and health of the system.
Each one of these bullet points and topics is worthy of a book, and that is my plan. My plan is to explore each of these topics individually, dive into them, learn, grow, think, write. All of us have weak spots when it comes to certain parts of a Data Platform, it’s just inevitable. Personally, having built a few Data Platforms from scratch, i’ve been poked in the eye enough to have learned some lessons in each area.
As I look through the list, even at this moment, I can see where I took shortcuts or which areas I was not familiar with and winged it per-say. That’s just life I suppose.
I wish I would have had that perfect book or documentation about Building and Maintaining Data Platforms years ago, it would have saved me a lot of heartache and blood.