8 Data Engineering Best Practices
Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.
Best Practices
Well, here goes nothing … or everything, depends on your perspective.
- Unit Tests
- End-to-End Testing
- Documentation / README
- Use an IDE
- Idempotent Pipelines
- Docker and Docker-compose
- Delete Code
- Think before Write
We are going to start our journey to being better Data Engineers by exploring some of the more basic best practices. These first few are the foundation upon which great empires … or architectures are built. It’s extremely hard to go to “the next level” with your skills, your pipelines, and your team … if you aren’t embracing some of the commonly held ideas that are generally accepted throughout the software world.
Unit Testing
Unit testing in the Data Engineer world is like finding a bigfoot, talked about a lot, but rarely seen out in the wild. People talk the big talk, but most data teams have yet to unit test all their data transformations. This is a baseline, unit testing has so many benefits that make us grow as engineers and data practitioners.
- Unit testing catches obvious bugs prior to production.
- Unit testing forces us to write less complex functions/transformations.
- Unit testing makes use write clean code.
- Unit testing helps us write functions that are more modular and reusable.
- Unit testing streamlines our development process.
- Unit testing helps new engineers get into and contribute to the codebase with confidence.
That’s just to name an obvious few. There is something about the culture of a data team that unit tests vs one that does not. The ones that do not are typically just messy places to work, probably no development or integration envrionments, no documentation, bad codebases to work in, overly complex, and total disregard for engineering processes and best practices. Chaos reigns.
End-to-End Testing
Next up once you have passed the unit testing … test, are end-to-end and integration tests. While unit tests are a great baseline with all sorts of benefits, data teams usually are dealing with very complex data flows and interdependencies. This is where the devil is usually hiding, in the details.
Truely end-to-end integration tests are life changing. They can catch the little buggers that pop up at the connection points between a complex transformations and pipelines. Spots that unit tests have difficulty covering. The next best thing about end-to-end integration testing is that it ensures you have a good set of architecture and environments to support this kind of work, as well your codebase has to be able to handle it, which says something about the code.
- Integration tests catch what unit tests cannot.
- Integration tests can test pipeline inter-dependencies.
- Integration tests in a prod like environment are your last line of defense.
- Integration tests force you to get better at creating envrionments and architecture for your data stack.
- Integration tests ensure your codebase is good … aka it can handle it.
Don’t be a stinker, have end-to-end testing of your pipelines that is a button click easy.
Documentation / README
I’m surprised I even have to bring up this topic, but alas, that is the reality. It’s amazing the difference a little documentation can make, even if it’s half wrong, to a new engineer starting up, or someone new on a project. It gives context that is usually buried inside some grumpy and overworked persons head.
I mean, even a little README
can give great instructions on things like …
- How to run tests on the codebase.
- General description of what is actually happening in the codebase.
- Gotcha’s and assumptions made during project creation.
- A picture or chart of data flows.
- Descriptions of data sources and sinks.
There is no end to the benefits of some documentation or README
‘s attached to your codebase and pipelines. Even a paragraph or few sentences can shed light and save lots of heartache and frustration. Plus, as someone who writes documentation for the code they are producing you might actually get the benefits of realizing something it too complex, or not clear, and the documentation might drive you to change your code! Who would have thought.
Use an IDE
I know you probably like your Notebooks, or maybe your Notepad++, Sublime Text or whatever. These are no replacement for a full suite of tools found in some of the great IDE’s … like PyCharm, IntelliJ, VsCode and the like. I mean if anything you want to be like the cool people, so use an IDE. Honestly a good IDE will just make a good software engineer better, extend their capabilities and make them more efficient.
From personal experience I enjoy how my IDEs of choice, PyCharm and Atom push me towards catching errors while I’m writing code. Maybe the input or output doesn’t match the type
in the function. There are many things that can go wrong when writing code, a IDE helps you and pushes you down the correct path. And with the multitude of plugins available, it’s a no brainer. Use one. Get better.
Idempotent Pipelines
The same thing can be very boring, but, the same can also save you many headaches and heartbreaks. Probably one of the most fundamental ideas for data pipelines is the idea of idempotency. It’s simply the idea of …
“Running a data pipeline once, or many times, will produce the same result.”
– who knows
The simplest form of this would be a data pipeline that does only INSERT INTO
some table every run, and not something like MERGE
. Meaning if you run a pipeline twice it would produce duplicates or just simply break. It’s easy when starting out as a Data Engineer it’s easy to not think about a topic like idempotency, it can be somewhat esoteric depending on the context, but it is utterly important to the building of data pipelines.
- Data pipelines should be able to be run repeatedly, without any ill side effects.
- Idepempotency needs to be built into the data pipelines upon creation, not after the fact.
- There is nothing more annoying and time consuming then a non-idempotent pipeline.
- Using simple features like
MERGE INTO
statements will make most pipelines idempotent.
Docker and Docker-compose
Can you think of anything more annoying then cloning some codebase, and then having the inability to run the code or test it? I can. Having to sit there on your local machine and install every little requirement, download files, mess mess mess, to finally just run tests. Don’t do that. Containers have changed the world, and Data Engineers need to get on that train. The very first item on the list when creating a new project or pipeline should be creating a Dockerfile
that can house that project in its entirety.
Need the ability to run tests
? How about a little docker-compose up test
? Need a local Postgres instance running locally for your app or pipeline to work? It’s called docker compose
my friend. I find it hard to take any project serious, even open source ones, if they don’t have an officially supported Dockerfile
available for use. See these examples.
- All your requirements can be included in a
Dockerfile
, no more worrying about someone’s OS system of choice. - Docker compose enables you to automate simple tasks like running tests, etc.
- Docker compose enables you to make a complete working local environment for development and testing, including the entire architecture of most pipelines.
Delete Code
I’m fairly sure at least 80% of you reading this, didn’t expect this topic to show up in the list of best practices. Being able to “Delete Code” is such a general statement, but I included it for a reason. This is a skill. It’s a skill to be succinct when writing code, to not be long winded. If you are able to delete code, that probably means you can write sort and to the point data pipelines and transformations that don’t have a lot of fluff.
Most folk probably thing about being able to commit huge PR’s with all sorts of code and huge code bases. This might be a skill I suppose, but I’ve learned that lots of code isn’t nessesarily a good thing. Many times there is a lot of code because someone didn’t see something obvious, because they didn’t take time to think through the problem and find the easy solution.
- Code written has to be maintained.
- There are bugs in all code written.
- Deleting code is a better contribution then writing more code.
- Being good at deleting code make you better at writing it in the first place.
I challenge you to go into whatever codebase your working on this week … and find something to delete.
Think before Write
Maybe I will make this short and sweet. I bet there is something you can learn from The Grug Brain Developer , one of those being to think before you write. I’m fairly certain the next time you start a Data Engineering project, if you pick someone out, say another engineer, and talk about the topic for 30 minutes … you will magically save yourself 3 days of coding.
There are a lot skills tied up into “think before write.” It requires you to think about the data coming in, the data coming out, what’s happening to the data in between. It makes you think about the requirements … did this person really mean this, what did the mean by that, and so on. As type happy engineers we can easily just plunk down at our proverbial desk and start hacking away, and write code till our fingers fall off. That’s the easy part. The hard part is to stop and consider what we are doing.
- Thinking first will save you days on the back end.
- Most developers don’t think first, therefore when you do, you will be better than most.