Data Engineering a Machine Learning Pipeline In Real Life.
It doesn’t take long reading articles on Medium or Towards Data Science to become enamored with Machine Learning. Especially the people and companies who do it in “production.” I always read about the big picture, the fancy algorithms, the cloud computing, but you have that feeling there is something missing. It’s all the details that are missing. Where is the force behind it all, bringing everything together? I like to think it’s called Data Engineering, with some Dev Ops for good measure.
I’m always amazed at how complex some people can make things. It seems to me software engineering is especially guilty of this. Over-engineering a system in my opinion is one of the worst mistakes that can be made. It has long lasting tentacles that reach out from the dark when least expected, dragging some poor soul down to Davy Jone’s Locker. Breaking tasks down into their simplest form is one of the first steps I always take. I think this applies well to the systems and pipelines behind machine learning. This is pretty much what I do all day at work for the man.
The Not So Magic Behind The Scenes.
There is no magic behind the scenes when it comes to Machine Learning pipelines. I’ve found it’s really just the same ol’ thing, at least for anyone who’s been around moderately complex ETL.
- Pull/Gather Data.
- Transform the Data.
- Load/Stage Data into/for Model.
Sure there a few different steps and challenges that are specific to ML pipelines.
- Training on Data.
- Dealing with Big Data.
- Mix of Disciplines and Tech.
It’s Always About the Data
Why do I say there is no special sauce behind building ML pipelines? Because most of the time it’s just like writing advanced ETL code. What do I mean by “advanced” ETL code? What I mean is that usually it involves more than just picking up CSV files that have been dumped somewhere and shoving them off into a database. What makes ML pipelines harder or different?
Training on Data.
If you have been around ML one thing you will learn is that Data Scientist and their hand curated models love….. data…. and lots of it. It’s usually what can make or break a ML system. You need and want as much data as possible to train and develop a model…. every time, especially as the model is changing in the beginning.
When R&D is being done on a ML system this means that massive amounts of data are being run through a pipeline or pieces of a pipeline over and over again. Bugs are being found and worked out. Especially in the beginning as tweaks and changes are happening day by day. Even when models become more stable, typically new data is always being feed into the system. Or new variables are being add and the model is re-trained again.
High Amounts of Training on ML Pipelines
This volume of data being ingested usually has some side effects.
- Speed Matters
Usually there are person(s) waiting for the next ingest run, and if it’s a big data system, something on GKE for example, the longer things take the more money your spending. - What’s Been Done
Another part of running high volume data set’s through ML pipelines is tracking what’s been done when. Most likely the data pipeline has multiple steps and transforms. Nothing will waste time and money and faster by reprocessing data more than once. Use a Postgres database or something else to log what’s been done and when. - Maximize Resources
Related to number uno, make sure to use every core and thread if possible. Most likely your data pipeline can be broken down into steps and units of work that can be parallelized. Do it.
Big Data
Another aspect of machine learning data pipelines that makes it the same, yet different, is the big data part. While this is in no way specific to ML, it’s probably more likely for a ML data pipeline to fit into the big data world then not. Things just change when when you start doing with multiple Terra-bytes of data.
Everything Starts to Matter
- ML Data Pipeline Design
Designing a ML pipeline needs some more architecture thought. When thinking about the scale and cost of such a system all option are usually explored. From cloud providers to the different services those cloud providers have, cool can sometimes mean expensive, best to figure that out upfront. - Storage
Storage can start to take on new hues of meaning. Instead of just randomly dumping things in Glacier or Nearline you might all the sudden realize waiting all those extra seconds for every file with multiple Terra-bytes means you will be waiting years, or paying for running a pipeline that is just sitting there half the time. Also storing meta data about the data becomes invaluable because knowing what you need and where to get it becomes a major task with big data. - IO
It’s hard to state how critical Input/Output becomes in a big data machine learning pipeline. Writing data to and from files, waiting for API’s, all this becomes major bottlenecks. To give you a hint I’ve written about specific IO topics here and here. Suffice it to say you will become an expert in the details
Mix of Tech/Skills
Last but not least, you will find that ML data pipelines contain a large mix of disciplines/tech that your average data warehouse ETL does not. This comes from the having the above combination of problems to solve. The large scale of the data, the speed and efficiency that is needed, as well as the creativity in the tech stack usually leads to interesting pipelines.
- Big Data Ecosystems.
Everything from the storage, compute, distribution, code tools and packages require at least medium knowledge in the discipline. Not to mention stitching everything together to create a pipeline that can be kicked off with a single command. Knowing your code is important, but in ML pipelines knowing the the mix of tools at your disposal and what to use when is a big deal. - Mix of Tech
I mentioned this above, but I’ve found that ML pipelines require you to broaden your horizons. In a single pipeline I’ve worked on Postgres, Docker, Kubernetes, GKE, Cloud Storage, Stackdriver. Most of that before you even get to writing the actual pipeline code.
In Conclusion
What it comes down to is that machine learning pipelines are not magic. There isn’t that much that makes them that more difficult then your average ETL. The pieces that make the ML pipelines more difficult are the things that make working on big data and complex systems difficult. It’s learning the large variety of tech stacks. It’s realizing the devil is in the details and that slow and inefficient code won’t hide itself when your processing terabytes of data over and over again.