A Diatribe against Data Contracts and their Abuses.
Ok, so I don’t really mean all that. Or do I? I have no idea what the future holds. Sometimes it’s easy to pick out the winners, like Databricks and Snowflake, you can see, feel, and taste the results of those data products, a delicious and delectable bounty to feast upon. Other things are harder to read the tea leaves on. Kinda like Data Mesh … is it a thing, or is it not a thing? It’s hard to decern between charlatans and marketing/sales departments hocking the next Cure All Snake Oil and real life.
What about all this recent humdrum and buzz around Data Contracts? Pushed by some popular Data Engineering faces like Ananth Packkildurai and Chad Sanderson. What is all the hype about Data Contracts, are folks just pushing another tool down our throats? Is there a real issue and problem that can be solved with Data Contracts?
What are Data Contracts and what Problem(s) do they solve (allegedly)?
I honestly only have a vague idea of what the experts would call a proper Data Contract … I just pursued various Linkedin posts and blog posts from these folks and came up with a list of thingy-things that could help us peons understand what they are, and maybe what they are good for.
- “Data contracts are decentralized”
- “Contract ownership and implementation must be handled in a decentralized manner upstream”
- “Data contracts are enforced at the producer level:”
- “must contain additional metadata beyond the schema, including descriptions, value constraints, and so on.”
- “create a brand new contract with the backend and agree on the best need for (data) consumers.”
- “Data contracts are just a trend to give back ownership to data producers“
- “The biggest challenge is organizational”
- “Data Contract ideas seem to be closely tied to streaming architectures”
Sure, I get it. This is what Data Contracts see as the problem and are trying to solve.
Honestly, after reading and reading I came to a few conclusions.
“There appears to be consensus that Data Contracts are about pushing data definitions back to producers.”
– me
And also
“There appears to be little agreement upon what the implementation of an actual Data Contract would look like, besides schema and data type definitions.”
– me
It honestly appears like those peddling Data Contracts, like the proverbial Johann Tetzel of yelling in the streets, are the same ones who’ve been screaming about Data Mesh for a while. It does appear that Data Contracts are just an extension and implementation of Data Mesh concepts. Although I have to say, it seems like the whole Data Contract space needs some more implementation work before it goes mainstream. It’s just the actual rubber meets the road implementations that seem to be a little bit fuzzy.
Silly Things Data Contract Purveyors Assume
I think that there are a few things that the purveyors of Data Contracts are missing. These are just some general thoughts from a crusty, grumpy, old Data Engineer who’s seen many things come and go over the years … while in reality, very few things change besides technology.
- The concept seems to be built around assuming large to huge organizational structures. The best tools and ideas work across all data teams.
- Even in today’s modern data stack, most data pipelines you will find are batch, not streaming. Data Contract resources focus too much on streaming.
- Data Contracts need to differentiate themselves more from simple schema control.
- Pushing Data Contracts back to the source assume those teams have good Data Engineering experience.
- Pushing data concerns back to different product teams has been tried for years, and rarely works.
- Data Contracts need a better case for why good data practices and data quality isn’t good enough.
- Data Contracts need better and clearer end-to-end tooling explanations and integrations, so folks know what they are signing up for.
I’ve worked at very large organizations where the concept of a Data Contract owned by some product engineering team would seem wonderful. I’ve also seen those same teams struggle with seemingly basic data tasks because there is simply no one on that team who works with data on daily basis and has those years of experience that are actually required to manage and model data well.
I mean I’ve met and worked with some of the smartest software engineers, capable of incredible feats of engineering and design, yet those same persons struggled with seemingly simple data tasks (simple to those who’ve spent their career working with data). The very existence and rise of the Data Engineer should tell you something … that data products are complex and require specialized skills.
Also, I see Data Contracts more as an implementation inside a Data Quality tool, it seems to make more sense. I mean most of the content and posts you read about Data Contracts mention schema management like 15 times, as well as its surrounding meta-data, data types, etc etc. It would seem we already have a plethora of tools to manage such things.
If the Data World as a whole can barely get Data Quality tools adopted as standard practice (think Soda Core or Great Expectations), then how in the world are we going to get product teams to all adopt Data Contracts? They are just as busy and overworked as anyone else.
The Good Coming From Data Contracts.
All that being said, I do have to admit that there are probably good things that would/will come out of the Data Contract craze. One such tool is called schemata is released as an open-source project. Sure, making other teams outside the central data team more responsible and aware of the stake and ownership they have in providing good data quality is a wonderful idea, but that is nothing new. Maybe Data Contracts want to make that easier for those teams without said knowledge to do so?
I suppose Data Contracts are good for those large organizations that are adopting a Data Mesh attitude and have truly made Data a number one priority and are willing to force contracts down the throats of large product teams who have bad habits, this could only prove beneficial in the long run to solving a lot of data headaches.
What will become of Data Contracts? Will it be a standard thing in a few years, will it die a slow and silent death? Time will tell.