What is a Healthy Lake House?
Maybe I’m the only one who thinks about it, not sure. The Lake House has become the new Data Warehouse, yet when I ask this question “What makes a health Lake House?” no one is sure what the answer is, or you get different answers.
It seems like a pretty important question considering that Lake Houses have taken the data landscape by storm and now store the vast majority of our data. With all the vendors pumping out Lake House formats and platforms (think Delta Lake and Apache Iceberg), the main focus seems to be adding features and addressing internal data quality, aka the quality of the data stored in the Lake House itself.
But this isn’t what I’m talking about when I saw a “Healthy Lake House.” What I mean about that is …
- we have peice of technology that is holding our data (Delta Lake, Iceberg, Hudi, etc.)
- it’s central and extremely important to our Data Platform.
- how do we know it’s working correctly or well?
For all of us who’ve worked in tech, making software and writing code, we know that all technology tends to atrophy over time, at least in active systems, and large data systems like a Lake House tend to have that same problem.
What do I mean by atrophy? Not sure, maybe degrade is a better term?
What we all know about Lake Houses (formats).
If there is one or two things that come to mind when you think about a healthy Data Lake, these two or three things should be at the top of the list, mostly because these platforms in question provide what I would call maintenance features. Meaning they are provided for us and the assumption is if we don’t use them, the Lake House system will degrade.
- OPTIMIZE or COMPACTION
- VACUUM or PRUNE
These Lake Houses are made up of many millions of parquet files with data in them. If those file sizes are not optimal, we will have problems.
There could be too many small files, could be too many big files, either way, the IO of our systems trying to access said data in Lake House will suffer from non-optimized file sizes. Most Lake Houses provide some sort of maintenance we can run to reshuffle/write files into the optimal size.
- conventional wisdom seems to be …
- < 3TB table requires 256MB sized files
- >3TB requires 1GB file sizes.
Who know’s if those are right, but probably somewhere in there, with testing, is the right answer. Of course you can see the problem, too small and excessive IO would occur, the classic HDFS and Hadoop “small file problem.” If the files are oversized, queries would not run well either for obvious reasons.
The other problem is with the idempotency of parquet files requiring any sort of CRUD operations to write new files, making others obsolete. In large Lake House operations, this can mean a lot of “dead” files that could cause overhead and take up un-need space.
I’ve often asked myself, as someone who runs a very large Lake House platform in real life … is there a tool or thing I can use that will tell me if my Lake House is healthy? Am I missing other things I should be concerned about?
Other things that come to mind are data skew, and in particular partitions or clustering of data. Is partitioning or clustering of my data for each table optimal, does it have skew (partitions with very small amounts of data, and others with large amounts?).
To that end I started to work on https://github.com/danielbeach/BoggleyWollah/tree/main
Think of it as writing code out loud to answer these questions. They appear to be simple questions to answer on the surface, but can turn out to be more complicated. When you start peaking under the hood of things like Delta Lake and learning about checkpoint.parquet files and the wonderful transaction log made up of too many JSON files.
How else can you know about how many dead parquet files you have without being able to properly process the “state” of the Lake House and and identify problems?
It will be a work in progress over a long period of time mostly likely, hopefully it will culminate in some PIP or UV installable Python package that can give us a rough idea of how our Lake House is doing. I’m starting with Delta Lake and then will work on Iceberg and Hudi after.