A Piece of DevOps that most Data Engineer’s Ignore.
I am always amused by the apparent contradictory nature of working in the world of data. There is always bits and pieces that come and go, the popular, the out of style … new technology driving new approaches and practices. One of the hot topics the last decade has produced is DevOps, a now staple of most every tech department. Like pretty much every other newish Software Engineering methodology, data world has struggled to adopt and keep pace with DevOps best practices. Once these is always a thorn in my side, making my life more difficult. The simplicity with which it can be adopted is amazing, and the unwillingness and lack of adoption is strange.
Dockerfiles for Data Engineers
The lack of a Dockerfile
in any data pipeline and repo I explore tell’s me everything I need to know about the quality and setup of the codebase. Most folks in the data world live their life without it, thinking that containerization is for the software engineers of the world, but this is not the case. If anything the Data Engineering and Data Science worlds have more of a use case for Dockerfiles
then most.
Why data needs Dockerfiles
It’s pretty common today for most Data Engineering/Data Science/ML workloads to be Python heavy. What’s the best and worst part about PyPI and Python packages? They are incredibly finicky, break easily, cause requirement conflicts, and require a large amount of magic to not break over time.
What else is common for data workloads and pipelines, relational databases and the connections that go with them. Anything else? An amazing number of command line tools.
Why could there be more reasons? I’m glad you asked, yes there are more reasons. Typical complex data pipelines and codebases require environment variables, configurations, specific directories and code layout.
This is what a Dockerfile
is for. Why not make life for yourself and others easier? With a simple docker run
or docker-compose up
command everything that is need to run and test pipeline code is at your fingertips. All the setup complexity is written once and hidden away rarely to be messed with again.
Reasons to use a Dockerfile for data pipeline(s)
- no surprise updates and breakages do to
OS
orpackage
updates. - easier onboarding new engineers into the codebase
- requirements, configuration, env vars all become easier to manage.
- everyone is on the same page, no windows vs mac vs linux gotchas.
- easier to transition code into distributed environments (think Kubernetes).
- better DevOps (code deployment) and unit/integration testing.
- makes you better at the command line (with makes you better in general)
Getting started with Dockerfiles
Getting started with Dockerfiles
. First thing to do is install Docker desktop, easy to use and easy to install.
There are two (probably 3) options to write/use Dockerfiles for data pipelines. First, it’s good to understand docker hub , it’s where pretty much every project under the sun, plus some, stores official Dockerfile(s)
for your use. Need to run Apache Spark? Why install it on your machine when you can get a Dockerfile
with it already installed? Got a Python based project? Why not just use of the many Python Dockerfiles
available.
These pre-build Dockerfiles
can be obtained by a simple …
docker pull python # or whatever else
The other option is to build your own Dockerfile
, based on whatever OS
you want, with whatever packages
and tools
you need …. even layered ontop of some Dockerfile
from someone else.
Let’s take the example of someone who builds pipelines that run in AWS
on Linux
based images. You want a good development base that is as close as possible or exactly like production correct? So you build a Dockerfile
that has say Python
and Spark
based on Linux
with the aws cli
installed.
FROM ubuntu:18.04
RUN apt-get update && \
apt-get install -y default-jdk scala wget vim software-properties-common python3.8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev
RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz && \
tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
mv spark-3.0.1-bin-hadoop3.2/ /spark && \
ln -s /spark spark
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install
WORKDIR code
COPY . /code
RUN pip3 install -r requirements.txt
ENV MY_CODE=./code
It’s just an example but you get the point, defining a complex set of tools that won’t easily be broken that all developers and users of the pipeline can use is very simple and powerful way to make development, testing, and code usage easy for all.
Usually a Dockerfile
written like this stored with the code can be built using a simple command…
docker build --tag my-special-image .
Also, make sure to read up on docker-compose
. A great way to automate running tests and bits of code.
Musings
Dockerfiles
are far from rocket science, they are probably one of the easiest things to learn, even as a new developer. Like anything else they can get complicated when running multiple services, but the basic usage of a Dockerfile
will give you the 80% of what you need up front.
I also believe Dockerfiles
in general force a more rigid development structure that is missing from a lot of data engineering code bases. When you find Dockerfiles
you are more likely to find unit tests
, documentation
, requirements
files, and generally better design patterns.
Trackbacks & Pingbacks
[…] A Piece of DevOps that most Data Engineers Ignore […]
Comments are closed.