Data Engineering – A Day in the Life.
Yeah, data engineering seems to be a hot topic today, as much as data science is/was 3 years ago. What does a data engineer do, what skills do you need? Peruse the job postings it will quickly become overwhelming. Spark, Hadoop, SQL, Python, Scala, ETL, Data Warehousing, various Data Sciencey Things, Streaming, Analytics, Business Intelligence, Machine Learning, blah blah blah. What are the top skills need to be a successful Data Engineer? What does the average day look like for a Data Engineer? Here is my two cents, it’s probably worth what you paid to read this.
As someone who spends a decent amount of time following and reading what’s happening in the data engineering world, it’s become very apparent to me, what you do as a data engineer mostly depends on where you work. It’s usually about the tool-set at a given company and where they are in the data landscape. Some companies are farther along than others, some are just starting to build data warehouses and lakes, others have move past that too “big data”, machine learning, and data science. But, even with these differences there does some to be a few common denominators that I see popping up more often that not across the data engineering landscape when reading about others’ experiences. Let’s talk about those. These are the things I consider data engineering 101, the basics. From my point of view.
- SQL
It’s amazing to me how much has changed with NoSQL and other distributed data processing engines, but one thing remains that same. SQL is everywhere. Who cares if it’s MySQL, PostgreSQL,Sql Server, or Oracle? Learn one, learn them all. To be a good data engineer, at some-point in your career you should learn to be above average at SQL. Many data sets today live inside RDBMS databases and a good data engineer needs to be able to pull out that data in a efficient manner.
The basic understanding of how indexing works, joins, sub-queries, reading any query plan and understanding bottlenecks and problems is really a basic requirement. Being able to write a simple SELECT statement just isn’t good enough. Most relational data requires a lot of relational work, and many times the business logic involved in pulling out useful information is complex and needs to be written in such a way that it is performent and correct. Being able to know a good Data Model from a bad one is key. Anyone inside a organization should be able to look to the data engineer as the Gandalf of the database, query, and general SQL wizard for all hard questions and problems.
A data engineer should also at least be a crappy DBA. I mean that they will probably never be as good as the tenured DBA who’s been in a closet with his database of choice for the last 15 years, but they should be able to understand enough to help with troubleshooting database issues on any platform. They need to understand logs, where data files are stored, various configurations and how they can affect things. - Code
The data engineer must be able to write code of some sort. Again, this probably depends on where you work and your background. Some data engineers come from the software engineering world, some don’t, the right answer is somewhere in between. The problem with being too focused on software engineering is that in many cases the data engineer is bridging gaps between the data science and business world, so perfect code can produce results and data that of no use to anyone else. The problem with not being disciplined with software engineering techniques is that data processing pipelines are prone to error, not well tested, and generally unreliable.
There has been a rise of Java and Scala in the big data engineering world, but I would suggest that Python is probably the best idea, at least starting out. How good do you need to be at Python? You probably need to know how to process CSV files, know what generators are, and something along likes of pyodbc, probably a little of json and requests. Building data pipelines, aka moving data from one location to another should be a relatively straight forward task for the data engineer. I would suggest Michael Kennedy classes and podcast if you want to get better at Python. Understanding tuples, sets, lists, dictionaries etc will make your life easier.
With the raise of distributed engines like Spark and Dask, one with a Python API and the other build on Python, it would probably be a good idea to at least explore these systems and try to understand them. Of course learning the basics of Hadoop/HDFS is probably a good idea too. These technologies give you a good insight into the other end of the spectrum from RDBMS. - Business Intelligence
Some people would disagree about I think BI is a integral part of being a good data engineer. Typically understanding the different methods of storing and organizing data, which data to store, etc, is about knowing how the data will be consumed, and what type of information is needed. I would call this business intelligence. Is the data relational in nature, would it benefit from a Kimball style data warehouse, maybe columnar storage in S3? Being able to understand the options and which one best fits the use case will avoid wasted work and business units who frustrated with IT not getting it.
You would be surprised how many people in the BI world are expected to install, configure, and administer their own BI environments. This means you need to understand the basics of clusters of computers working together, being comfortable on the command line of both Windows and Linux based machines. Knowing how to open and close ports, work with firewalls, and generally be comfortable moving around servers will only make you a better data engineer who can build and maintain workloads and data pipelines across a variety of systems. Not every place you work will have a System Admin who will install systems for your and answer questions about network communication.
Lastly, a good data engineer should probably be able to put basic reporting together when needed. The ability to understand what a dashboard is how to use Tableau, or whatever, will only ensure you are more prepared to store and organize the data in such a way so that you realize the importance of how applications consume data structures, and how that in turn affects what you do.
In short I think the data engineer has to walk in a lot of worlds. Being technical enough to work with and along side software engineers will only make the data engineer smarter, and able to build scalable, reliable data pipelines and workloads. Being business savvy enough to understand business requirements and translate that into the correct real world applications of technology selection and implementation that can produce usable results in a reasonable amount of time makes data engineering valuable to an organization.
Many days I can end up feeling like the picture in the intro. Trying to juggle way to many technologies, for different people with different wants and needs. But that’s why I love being a data engineer. I never know what I will be doing tomorrow.