Data Modeling in the Brave New Lakehouse World
It is a Brave New World out there these days. The new tools and features come out faster than your mom on Sunday morning getting you ready for church. The same goes for the context and advice being produced on a myriad of platforms, the ole’ Like and Subscribe, and all that bit. It does make you wonder after a while, what you can trust, who has your best interest in mind, and who is selling you a bottle of snake oil, doesn’t it?
Today we talk about Data Modeling. Specifically Data Modeling in the new world we all live in christened The Lakehouse by our benevolent Vender Overlords.
Data Modeling is an art, not a science.
Let’s be clear here … we are talking about the Lakehouse … the cloud storage-backed, open-source storage format buggery that has become near and dear to us all. Delta Lake, Iceberg, whatever. We are not talking about Postgres or other RDBMs systems.
Relational Databases are not as forgiving as our Lakehouse. In the SQL Server, MySQL, or Postgres world you need to normalize the data, create minute indexes for every case, throw salt over your left shoulder, and spit on the ground to get those queries singing a fine tune.
Not so much with Spark + Lakehouse. That stuff don’t matter anymore.
Honestly, things have almost atrophied in Data Modeling with the advent of the Lakehouse, we’ve gone backward. Before we used to have to be careful, spending days coming up with the perfect DDL and indexes to support what had to be done. Today we can throw Spark on top of a pile o’ crap and it does wonders.
The Great Amalgamation of Data Modeling.
We are no longer enthralled by some terrible DBA who comes stalking out of the basement looking for our blood because we didn’t do something “Kimball” style. Those names like Inmon and Kimball that were once lauded as Demi-Gods sent from above to use mere mortals have all but lost their potent power with the next generation of Data folk. It is truly a Brave New World.
How I Data Model in the Brave New Lakehouse World.
Well, after all that huffing and puffing I should probably tell you how I go about Data Modeling in the Lakehouse. It is pretty much a combination of all of what has come before, a true hot pot of delicious meats I’ve picked and chosen from wherever I please.
I will just give you a bullet list of what I do.
- Throw the Bronze, Silver, and Gold modeling material out the window for the dogs.
- Spend a fair amount of time trying to understand the Partitions or the Clustering Keys for the dataset.
- this could be time series-based… (year, month, day, time, etc)
- Much like the old days spend some time understanding the queries or use cases for the said data set.
- What will be filtered on, searched for etc.
- Stick with the classic Kimball-style Data Warehouse design of …
- “staging” or raw tables
- decide if a data set is a FACT or DIMENSION
- Is it quickly accumulating non-changing (FACT) data, or not so fast growing and slowly morphing/descriptive (DIMENSION) data.
- Keep tables wide … no OVER-normalization of datasets
- the less joins Spark has to do the better
- Study the Data Types carefully
- try to make sure JOIN keys for datasets are not string etc.
- Understand the Uniqueness (what used to be known to the old Mountain Men as Primary Key) for each dataset.
Doing these simple and time/battle tested things will lead to success at pretty much all levels. You will understand the datasets and their use cases well enough by doing the above. The data model will be simple enough that you won’t have to do 15 joins and 3 backflips inside Spark to get what you want.
And, most important of all …. it’s SIMPLE.
I don’t have to argue with anyone about normalization methods, I can ignore useless Silver layers that rack up storage and compute costs, I have OBVIOUS (even to non-engineers) Fact and Dimension tables that are self-explanatory to even casual users.
It requires no ERD diagrams designed by overpaid and useless “architects.”
When you find yourself in some strange Brave New World …
So next time you get confused and these modern Lakehouses that have come to save the world and bring us to paradise, you don’t know who to listen too … I have a suggestion.
Do what you understand and what is simple and has been used for ages. Pick and choose. Take good ideas from all sources and throw them in the pot … give it a good shake, and go on your way.