The Death of the Data Warehouse, replaced by the Lake House. Or Has It?
This is an interesting one indeed, it’s one that teases and puzzles the brain to no end. Has the Data Warehouse finally died, has that unruly upstart the Lake House finally taken its place atop the seething mass of data we call home? Can we say that after all these decades the Data Warehouse Toolkit and Kimball is finally gone the way of the dinosaurs? Maybe. Probably. I don’t know.
The Death of the Data Warehouse.
Now it probably depends if you are an avocado toast-eating Millennial, or a Gez Z ideologue in all your Reddit glory, but I think it’s fair to say that although remnants of the Data Warehouse do exist, and will continue to do so, much like the Colbol Mainframes, for all intents and purposes … the Data Warehouse is indeed dead.
I mean, how would you like to be the CTO explaining to some group of executives the reason you’re still using SQL Server or Oracle to produce analytics out of your Data Warehouse? That ain’t going to get you hired again. What’s cool is cool, and the tech bros know you need to be able to say the world “Lake House,” “Databricks,” “Snowflake,” or “AI,” to stay relevant.
Times have changed. At some point, we simply must admit the world has moved on and if you still use a classic Data Warehouse, (I do not argue that you should or shouldn’t) you are at the very least viewed as irrelevant and behind the times, like it or not.
We should probably answer a question though.
What exactly IS a Data Warehouse?
I think someone might hem and haw about such things, but as someone who worked in and around, and built from scratch my fair share of Data Warehouses, I think I have a simple definition to put forward.
Nothing more, nothing less. That’s how we did it back in the day programming both ways uphill in the snow and rain. You got yourself a nice SQL Server, took a TSQL class, cracked open that Bible called the Data Warehouse Toolkit, and started arguing about facts and dimensions, third normal form, and all the jazz.
Heck, I used to go to User Groups back in the day where everyone would show off their shiny ERD diagrams and talk about their Data Marts and Analytics. The Postgres teams looking down their noses at the SQL Server teams, and both those laughing at the poor Oracle acolytes in the corner.
Those days are dead. Of course, we will have the odd hangers-on, it’s hard to kick against the goad, especially at large companies, but Databricks and Snowflake done kicked the rock down the hill, no stopping it now.
If those COBOL mainframes are still running in basements around the country, then you know those SQL Server Warehouses be clinging on for dear life for another 100 years.
If the Data Warehouse died, did the Lake House rise from the ashes?
This is where the waters seem to get muddy, maybe not, but a little. If the Data Warehouse is dead, you can bet that it was the Lake House that struck the death blow on the battlefield. It had some help, but it did with the hour.
The Data Lake did strike a few non-lethal blows, those Parquet files in s3, 7ish years ago made that Data Warehouse bleed. That free-for-all Data Lake was a mess, a perfect storm of rising tools like Athena and Big Query combined with the popularity of cloud storage turned the eyes green on all those Data Warehouse teams.
But, it wasn’t quite enough. To be able to replace the Data Warehouse you need a few more things.
- ACID and CRUD
- Governance
- Schema
- Constraints
Those little buggers are at the heart and soul of a good Data Platform.
If you asked me what exactly IS a Lake House, that is a good question.
For me at least, a Lake House it is a combination of abstracted cloud storage paired with a compute platform. It includes ACID, CRUD, Schema, Constraints, Governance and the like.
This is why it killed the Data Warehouse. It gives you what the Data Warehouse gave you, except at scale. No one cares anymore if you actually need all that power of Spark or whatever, that’s not the point. It’s what the cool kids are doing, it’s the new baseline of acceptance for the Modern Data Stack.
Sometimes the world moves on and we have to move with it. Like any new thing, the Lake House has come with its fair share of problems or perceived problems.
- Rising and sometimes uncontrollable costs
- Lake of standardization (Databricks is trying to solve this with Unity Catalog etc)
- More of a free-for-all than in the past
- Governance is a little shakey (this has changed in the last few years)
Still makes me wonder.
Is the Data Warehouse dead, or is the Lake House just the new 2.0 Data Warehouse? Maybe the Data Warehouse isn’t dead, maybe it’s just resurrected to us in a new form, in the form of Databricks and Snowflake.
If we had what we had before, only better, maybe it didn’t die, it just reincarnated.
What think you?