The 3 Types of Data Engineers, Which One Are You?
Every good story starts with a few different characters right? It’s like the spice of life, little bit of this, little bit of that. It’s the way of the world. In all my data wandering I’ve come across lot’s of different types of data engineers. I can usually put them into three different categories, somewhat similar but in many ways quite different.
Why 3 Types of Data Engineers?
I really think the different types of categories of data engineers comes down to skill sets. I also think these skill sets are driven by the marketplace, collectively driven by all the companies hiring data engineers and putting them to work on certain projects. It probably has to do more with the different categories of data inside organizations.
Also, I believe that the trajectory of a data engineer’s career “usually” flows along this path to some extent. I don’t think that one category his inherently better than another. In fact it’s probably true that lacking skills in some of the categories can be a serious impediment for many data engineers, in both the quality and design of data pipelines, and in career development.
The Categories of Data Engineers
Category 1 >>
theDatabase/Data Warehouse
andAnalytics/Metrics/Dashboard
engineerCategory 2 >>
Category 1
+Python
+Airflow/db
t (etc)Category 3
>>Category 2
+Scala/Java
+Distributed Systems
+Architecture
+Big Data
+ML
There you have it. Which one are you?
Category 1 – Data Engineer
Category 1 is where most, but not all, people start their data engineering journey. I would argue that it’s probably thee most important when it comes to be a well rounded data engineer.
The Category 1 data engineer cuts their teeth on traditional relational databases and data warehousing techniques. They know all about schemas, constraints, indexing, query tuning and performance, data quality, data management, data warehousing.
They are good at classic ETL and know all about Facts and Dimensions. They understand data quality and management … data governance.
These are skills that are essential to data engineering. When a Category 3 engineer is missing these skills, they might be able to build a pipeline that scales to hundreds of terra-bytes of data but the end result will many times not be so good, and never get used. Being able to model data correctly, understanding consumption and reporting needs are many times the hinge on which a project will stand or fall.
Usually these engineers are tasked with producing Dashboards of all shapes and sizes … Tableau, Looker, SSRS, you name it. They are skilled at requirements, talking with end users, and generally providing the answers from that data that the business needs. A very valuable skill indeed.
Companies that drive Category 1 engineers.
Like I mentioned before, all this is driven by the needs of the business and the tech stack they use. Many Category 1 engineers work in the traditional data warehousing world where monolith SQL Servers are used to aggregate and report data. The focus of these usually along the lines of flat file >> RDBMS >> reporting tool >> start that loop all over again.
If they live in the cloud it’s the RDS cloud, you can read more about them here.
They are usually slower to adopt the newer technologies and prefer the safe, known, and reliable to the jumping ahead.
Category 2 – Data Engineer
Many times the Category 1 engineer slowly wanders into Category 2 and wakes up a few years later without realizing it.
It usually starts when Category 1 engineer is tasked with pulling some data from an API, and then it all begins …. the Python. Category 2 engineers start to spend more of their time writing code than using SQL like Category 1.
I would venture a guess this is where the majority of data engineers fall. They become experts at writing Python libraries …. pandas, numpy, even wandering into PySpark with tools like Glue. The can write beautiful OOP classes, they are adept at working with unstructured and structured data from any source.
When you’re good at Python the world is your playground.
These engineers also start to get good at DevOps, they are good at building end to end pipelines. Orchestration is important to them, it could be Airflow or dbt, whatever. They focus on building automated end to end systems that move and transform data.
Companies that drive Category 2 engineers.
I would say a-lot of companies that drive the usage and demand for Category 2 engineers are startups and mid sized companies that are pushing hard to move their tech stack forward and be data driven decision makers.
They care about DevOps and automation, they have usually in the TB’s of data and out of necessity become unable to deal with the types and size of their data in the traditional manners. They also play in the Machine Learning space, able to write pipelines that product ML models and features in production without interruption … not an easy feat.
Category 3 – Data Engineer
It usually culminates in Category 3 engineers, these are the one’s pushing the bounds of what’s possible.
In this world you have to be an expert in distributed systems … Kubernetes, Spark, Pulsar, etc. The size and complication of the data and computations can only be done at scale. Many times this includes wandering out of Python and learning Java/Scala … out of necessity to debug strange Java errors on some cluster.
Many times you can consider these engineers to be architects, senior level coding skills, they know the challenges of processing big data on clusters.
Knowing how to stitch together and orchestrate complicated pipelines to work on terra-byte/peta-byte level is something they enjoy and excel at. They are usually experts at every level of a pipeline, from the storage layer to the compute.
The Category 3 engineer is no stranger to the world of MLOps, probably one of the toughest problems being solved today.
Musings
Which one are you? Maybe all of the above? I don’t think one category is better than the other per-say. I’ve seen my fair share of Category 3 engineers who skipped Category 1 and lack the very foundational skills to be able to actually produce usable data … just a huge pipeline producing data for themselves and no one else.
I think a Data Engineering team with persons that excel in each category is the dream team. Being able to model data and think about quality and data warehousing is absolutely essential. Add to the mix some awesome Python and distributed systems knowledge and there is little that cannot be accomplished.
This article makes so much sense to me.I can very well relate with this. I started out as a data analyst (building dashboards and SSRS reports) and slowly moved towards more technical projects (automating file exports, basic ETL jobs, Database deduplication all these using python) it has been a few months since i became a data engineer. I’m still a novice and trying to learn a lot, but it has been an exciting journey. I thought this was only my journey, and that data engineers are usually software engineers who already possess the data engineering skills from their college education (I have an engineering background but not in computer science). From what you say here, it looks like there are many that start out as data analysts. Glad that I’m not the only one.! Thanks for this article.
Hi Daniel,
I am also a Data engineer and I really liked the article, especially how you categorized data engineering and explained what is it about every category and what are common steps when progressing in this field. I confirm all of the above as it was same with my data engineering path, going from ETL development and data warehousing to using python for working with all kinds of data sources, distributed systems and cloud platforms.
Thank you for the article and best regards!
Hi Daniel,
Great Article and definitely learned a lot from you!
Thank you