Challenges of Machine Learning Pipelines at Scale… When You Don’t Work at Google.
ml pipelines
Building Machine Learning (ML) pipelines with big data is hard enough, and it doesn’t take much of a curve ball to make it a nightmare. Most of what you will read online are tutorials on how to take a few CSV files and run them through some sklearn package. If you are lucky, you might find some “big data” ML stories on Medium where someone uses Spark to crunch a bunch of JSON, Parquet, or CSV files at scale of 10 to a few hundred gigabytes of data. Usually they are simplistic and ambiguous. Unfortunately that isn’t how it works in the real world.
the real world
What happens when you have to work on more than a hundred terabytes of data?
What happens when the data need for the ML model can’t be found in nicely formatted databases, csv files, or other structured data sources?
What if you work in an environment where Python is your only choice because that’s what the vast majority of the Data Scientist can use and develop methodologies in?
What happens when you don’t work at Google and your options don’t include Spark or Scala…. you get PyCharm, some incredibly complex code, and your imagination?
how to solve the complex ml pipeline problem
These problems are solvable, not impossible. It’s temping to get caught up in the discussions and hype of Cloud Architecture, specialized distributed processing, streams, API craze, service madness, serverless mania. Many times fitting complex and large scale ML data pipeline round problems into these square holes just doesn’t work. That’s when you have to go back to the basics. You can use these four tools that will enable you to process well over 100TBs of data into a specialized ML pipeline.
the tools
- Apache Airflow
- Kubernetes
- Docker
- Python
First, you need to break down the problem of assembling your data into analysis ready data, data ready to be feed into the Machine Learning model itself. I let the smart Data Scientist decide what that data looks like. I just worry about how to get it there. In my case, many times, it includes complex geospatial transformations that make your head spin. But that’s ok, typically stepping away from the problem as a big data problem can help. If you have to solve a problem that doesn’t fit in the Spark box, that’s ok, solve as you would if you were running it on your laptop. In the end it’s going to be running on someone else’s laptop in the Cloud anyways. You can deal with distribution later.
the steps
First, if the pipeline is complex and requires many steps and dependencies, choose something like Airflow to solve that problem, you won’t be sorry. Heck, you even get a nice GUI for free, can’t beat that.
Second, just write your Python code to solve the problem inside a Docker container. There are pre-built Docker containers for everything these days, you can find one with the tools you need already installed…guaranteed.
Third, step back from the problem and decide what the lowest level or smallest unit of work you can reasonably break the code into. When your dealing with complex big data even at the multiple TB level, and in context of preparing it for ML analysis, you will still find that there is some set of processing ETL work that is done over the masses of data. You just need to find out what those steps are, by solving it at the lowest level possible. What is the smallest piece of data you can apply those steps to?
Fourth, put that sucker out on Kubernetes. Kubernetes has become ubiquitous, is everywhere, and is offered as a service by AWS. Google, and Microsoft. Meaning it takes little to no effort to learn it after a day or two of reading and messing around. What you basically should have is an orchestration tool in Airflow telling your complex pipeline in what order to run and what to wait on, while the pretty little units of work in Docker run on hundreds to thousands of Kubernetes PODs.
Sure you might have to get creative in logging, tracking, and error handling, but let’s face it, those problems are insignificant in comparison to the problem you just solved.
the secert
Everyone wants you to think there is a secret to processing big data ML pipelines at scale. I’m here to tell you there isn’t. Data is dirty and many times the problem calls for complex and out of the box software transformations. There is very good chance if it’s complex and unique, something like Spark just isn’t built for it, the hurdles to make it fit sometimes aren’t worth it.
The secret is just stepping back and being creative. It’s ok to add a little color and complexity to a ML pipeline if it warrants it. Sometimes Software Engineers who work everyday on small units of data flying around APIs won’t get it. Sometimes even Data Engineers who have spent their lives in SQL and CSV files won’t get it either. Even some Data Scientist who have been able to solve all problems with Pandas won’t get it. If it was easy anyone could do it and it wound’t be hard.
Step back, define your problem, even if it’s custom or unique, write your code to fit in Docker. Figure out what the smallest unit of work is, even if it requires some intermediate “antics.” Re-write your almost there code to fit, push to Kubernetes, watch the data fly!