5 Basic and Undervalued Data Engineering Skills
What is the standard for most data engineers these days? Turns out SQL and Python are still running the show pretty much across the board. There’s always a variety of skills in those areas, some better, some worse, although with a little work and repetition it’s pretty easy to master both SQL and Python. I’ve found that Python and SQL … or Java … or Scala … having good development skills is really only half the battle. It seems there is always a few basic data engineering skills that come up over and over. They are simple skills, foundational skills that allow an average data engineer to be better. They make a person more versatile and able solve more complex problems and work across a wide variety of of tech stacks and cloud providers. What are they? Read on my fair weathered friend.
Data Engineering Skill #1 – Linux bash and servers.
If there is one thing that I’ve noticed that can be a challenge for some data engineer’s … it’s being adept at maneuvering around bash
and more specifically being able to work with and around servers …. without RDP or a UI.
In the world of distributed systems, containers, Kubernetes, Docker, cloud providers … and the list goes on, to be the most effective data engineer possible you have to learn your way around a Linux bash. Being able to work around the command line to do common tasks, research, explore, and troubleshoot is going to be key now and in the future.
What are some bash operations you will probably find yourself doing often as a data engineer?
ssh
being able tossh
into remote servers.- generating
keys
andsecrets
, it could bersa
public and private keys or even generating credentials foraws
orgcp
. vim
or some tool to that effect, quickly inspecting filed, code, and logs via the command line.apt-get
installing tools, maybe like theaws cli
orgsutil
, could be a program totar
orzip
files, who knows.
The list goes on but you get the point. Having some bash
skills is a must have for any data engineer. Being just as comfortable doing what you need to do on the command line as you would be in your laptop GUI is pretty important.
Data Engineering Skill #2 – Docker and container magic.
This is another basic skill that is amazingly lacking in certain data engineering circles. Docker and containers are one piece of technology that once you start using … you will never go back. Containers solve a few issues that can make data engineering difficult, and also make life easier in a few ways.
- ease of tool and dependency management across engineers (everyone is on the same playing field).
- makes testing multi-service environments easier (think
docker-compose
) - easy to package code and send to production, development etc via CI/CD.
One of the worst experiences as a data engineer, or any developer, is to be dropped into a project with complex tooling and dependencies. An engineer shouldn’t have to install Spark on their local laptop and get the exact versions and JARs perfectly aligned with the stars of the New Moon to able to work on the project.
Run tests shouldn’t require more than one or two command line entries … and that is close to impossible without Docker. Most serious open source project or otherwise now usually provides official Docker Hub images, and if not that tells you something, so the barrier to entry developing pipelines with these tools has been lowered.
Simply, containers are the future, and the future is here, time to get on the band wagon if you haven’t already.
Data Engineering Skill #3 – Unit testing.
This topic never ceases to amaze me. I’ve been asked in interviews before if I think pipelines should be tested …. ” we’ve thought about it but haven’t really done anything yet.” Honestly no PR/MR should make its way into a master
branch without unit testing. Data engineering can get a bad rap because some engineers think this is actually optional, when it’s not. Unit testing data pipeline code provides the following benefits …
- unit tests protect you from yourself.
- writing tests while writing your code, or before, will make you write better and different code.
- unit tests will make it easier for other engineers to contribute new code or bug fixes.
- unit tests will make you modularize your data pipeline code more.
The list of benefits for unit testing data pipeline code could go on and on, but I will stop there. They will not catch every bug, but many times they can catch the silly mistakes we make. Usually it requires mocking and pulling production data samples to use in testing … which is probably one of the most eye opening and beneficial exercises you can go through before rolling code to production.
I think the hesitation and unwillingness to adopt unit testing by many data engineers is because they might not come from a strict software engineering background where this practice is more normalized, many times we have an aversion to things we don’t understand or are not familiar with. But, never fear, it’s not that hard and once you start, you won’t look back.
Data Engineering Skill #4 – Documentation.
People might label this as being picky, but I don’t think so. No one usually thinks about documentation until it’s too late, if at all. I think the problem here is that you have people all over the board when it comes to documentation, with strong opinions, and no accepted best practice.
I’ve been told by principal and lead engineers to “delete the comment because I can just read the code,” …. well, ok, but I guess we could just say that about everything. Is the architecture self documentation, the business logic, the reason certain decisions were made? It’s probably not a stretch to say that 5 sentences of documentation in a README
could save someone an hour’s worth of time reading code.
- documentation helps – especially with understanding high level data flow
- documentation can shed light on tricking configuration or interdependencies that are not obvious.
- documentation can help understand CI/CD flow, development, testing, and production usage.
- documentation can help with high level architecture and technology stack understanding.
Sure, you probably don’t need to comment every method or function that you write. But, is it good to throw a few sentences on-top of a new class
you are writing to let future engineers know what the are dealing with? Of course. Is there strange things about the data and data pipeline that aren’t obvious from the code, document it.
I guess you could just write code and tell everyone to read it and they will be fine, but that seems pretty short sighted … even to yourself when it breaks 3 months later.
Data Engineering Skill #5 – Architecture and Data Modeling.
Why do these two go together? Because in the end data engineering and coming up with an “architecture” has a lot, if not all, to do with the type, size, and format of the data coming in and out of the systems we design.
Everyone loves to just jump in the code, we are engineers, we solve problems … dive in head first and worry about the details later. Yeah … sometimes that works out, most of the time it doesn’t.
There are many reasons why architecture and data modeling should become of larger part of your data engineering journey.
- data modeling helps you understand in a concrete way what you are dealing with.
- architecture helps you understand end-to-end data flows.
- architecture helps you spot holes and problems.
- data modeling will drive you towards and away from tech stack choices (think Spark vs Kafka etc)
Not enough data engineers spend time in this arena. They just pick tool because they are using a certain cloud provider and that’s “what I’ve done in the past.” We should take time to think about what we are doing before hand, we will come up with simpler, cleaner, and cheaper solutions to the same problems.
Everyone is used to data modeling inside relational databases, but not enough engineers are versed in modeling data in cloud storage … or Data Lakes … which are the new data warehouse. There is a lack of basic understanding about data partitioning for example, this is a fundamental problem that should be addressed when data modeling any new system.
Sure architecture and data modeling might not be as fun to people as some hot new programming language or approach, but they are tools that should be in every data engineer’s tool belt to help build the next generation of highly scalable pipelines.
Musings
There are the 5 top skills that I see lacking in the data engineering community, of course there are many people who practice them like a religion, but not enough in my view. Data engineers can get looked down upon by others in the software world, and the lack or unwillingness to learn some of these basic skills are a reason for that.
The bash
command line is important, everything runs on a server in the cloud these days … you have to know to get around these machines.
Containers are here to stay, they will make your life easier but putting everything for your project in a box that can be run and used by anyone or any new tool out there today.
If your pipeline code isn’t unit tested, honestly you need to ask yourself a few questions and grow as an engineer. It just isn’t very acceptable to publish un-tested code.
Maybe you’re a child genius, but most people aren’t so write documentation that points out the idea and the things that aren’t obvious.
Most important of all probably, next time you start to write a data pipeline, give yourself 1 hour of architecture time to think about tools and data models before you write a single line of code or make any decisions. You might be surprised at what happens.
I would say modelling and architecture should be the number 1 skill
That forms the basis of all subsequent pipelines that will be build