Data Engineering vs Data Science – Where’s The Love??
It seems like a never ending battle for supremacy. Articles about Data Science being the bee’s knees, then more articles about how Data Engineering holds up the world of Data Science like Atlas. Whenever I read something in one of these two categories on Medium or wherever, it just seems more like ego clash to me. It’s human nature to want to be the best, to be better, to feel like you are the person who really makes it all happen.
My Two Cents – Data Engineering vs Data Science.
First, why does there have to be a discussion about who does what, who you need first, who provides more value? I think that the market place has shown there is a need for both Scientist and Engineers. It’s seems pretty obvious to me. The fact their are two different job titles in the area of big data (yes, with the emerging unicorn of Machine Learning Engineer) proves there is room in this world for both types. If you don’t believe me go do a job search. People should leave well enough alone and just be happy with both Data Scientist and Data Engineers, each in their own right.
Human Nature
Are their a subset of people who claim to do both? Of course. How many of those people who claim to do both are actually doing both well? Extremely few. There are always a few people like Andreas Kretz, who are truly gifted. I’ve meet many talented Data Scientist and Data Engineers, they are talented because they focus on what they are good at and passionate about.
It’s human nature to focus on what we care about. If you tell a Data Scientist to build some sort of recommender system what do you want them doing? You want them to focus on the model, fine tuning it, feature engineering. Do you really want them working on Kubernetes, worrying about Data Pipelines, logging, tests? I think it’s good to be well rounded, to know a little about a lot, but you need to also specialize and focus. How else will you become master of your craft?
The conversation should not be about who is better, the Data Scientist or the Data Engineer, I think it’s pretty obvious at what each person is going to be good at. The conversation should be about how to build a team of both Scientist and Engineers who can collaborate and learn from each other to produce the best possible product. I would spend less time thinking about how to find that perfect person who can build and design the perfect Machine Learning system and more about finding teams that can do it.
Why You Shouldn’t Let Data Scientist Write Production Pipelines.
This should be fairly obvious. Also, before you whine notice I said production pipelines. There is nothing wrong with Data Scientist working though and starting to write code while gathering data to start producing and researching a model. But, you should have a data engineering ready and willing to take that initial code and productionize it.
Why shouldn’t you let Data Scientist write a production Pipeline for their model? Because there is a 95% chance that the model will get a bad wrap when every time it is run there are errors, downtime, inconsistent results and a general lack of reliability. Why let a great model that has so many benefits languish simply because the architecture and underlying data movement is feeble?
You need a data engineer who is thinking about optimization, memory usage, cpu usage, threads, processes, logging, testing, data flow, and tooling. That’s what they do all day long, let them do it.
Why You Shouldn’t Let Data Engineers Write Machine Learning Models
This one is just as much obvious as the other one. Of course at some point a data engineering should start to understand the types of models, which models are picked to solve which problems, and what type of computations are involved. Sure, many of the data engineers would be pretty decent at math and statistics. Does that mean they should be writing Machine Learning Pipelines end to end? Probably not.
Most likely what you would end up with is the opposite of the above Data Scientist problem. You will also have a 95% chance to end up with a beautiful system that is reliable, uses the latest technology, is logged well, has tests and is generally reliable. The output on the other hand would most likely be questionable. If they put that amount of time into the architecture of the system and pipeline, you know where the time wasn’t put right?
Having a Data Scientist who understands in great depth how to pick the proper model for the problem is key. Someone who questions every result, and knows the difference between throwing something at the wall until it sticks, and the truth.
Build a Team
If you want the best of both worlds, make team of Data Scientist and Data Engineers. Let the engineers work on the architecture that will be used to serve, train, and ingest data for the models. They can spend time writing code to spin up clusters on GKE with a few lines of Python. Let them implement StackDriver or something else for logging. They would love to spend time working with the Scientist to understand the data needs and build a reliable and scalable pipeline to move and transform the data as needed.
Let the Data Scientist have time to dig into the problems, find the right solutions and tune their model, iterate fast, engineer feature selections. Let them all work together to solve problems as a group. Let them understand what the others are doing, and become masters of their craft with a healthy understanding of the what is going in their teammates world. Rarely is one person able to build the world, usually it takes a team.
In Conclusion
Again, I have never really understood the clash of the titans between Data Engineering and Data Scientist. Everyone should swallow their pride and just accept and pursue their passion and what they are good at. There is nothing wrong with people transitioning from one role to another if they want. Learning Data Engineering and Data Science sounds like a great way to become a well rounded person.
Most of the blame lays at the feet of the business and companies who have groups of Data Scientist and Engineers working in their own sound chambers without mixing it up. I don’t think their should be a Data Science team that doesn’t include a few engineers to lend a helping hand where needed.