Introduction to dbt … for Data Engineers.
So, you’ve heard about dbt
have you. I honestly can’t decide if it’s here to stay or not, probably is, enough folks are using it, and preaching about it. I personally have always been a little skeptical of dbt
, not because it can’t do what it says it can do, it can, but because I’m old and bitter from my many years of Data Engineering, and I always see the problems in things.
But, I will let you judge that for yourself. Today I want to give a brief overview of dbt
, kick the tires, muse about its features, and most importantly, look at dbt
from a Data Engineering perspective, ferret out the good, the bad, and the ugly. I will try my best to be nice but don’t count on it. Code is on GitHub.
What is dbt anyways?
I’m just going to pretend you’ve been living under a rock, or a desk, put there by your angry boss, unable to see the light of day, and therefore have no idea what dbt
is, and what it’s used for. dbt
has coined itself as the “transformation tool“, and the “The analytics engineering workflow“, which I find a little strange. Just call it what it is … another ETL tool. I know, I know, it’s more than that, testing, documentation etc. I get it. But it is mostly a data transformation tool.
What isn’t dbt
? Well, it’s not an extract or load tool, it doesn’t do that, just transformation. This does bring up the old grump in me, and I have to complain about it.
As Data Engineers building complex pipelines and data architecture, we should always keep an eye on complexity, when adding tools to the stack … they must give back more than they take. Just adding tools on top of tools creates issues inevitably. It should make you pause that you will have to use a different tool to do your Data extraction and loading … that same tool which most likely already has transformation capabilities! Fine, fine, I know you want all the other benefits of dbt
, but I’m just saying. Even if dbt
integrated with that tool, it’s just another abstraction and layer on top of probably what is already a complex system.
My take on dbt
.
What is dbt
to me? It’s heaven for those heavily SQL
based teams that rely on it for all transformations etc, especially those working in the Data Warehouse and Data Lake teams. Anyone using Postgres
, MySQL
, SparkSQL
, Snowflake
, Redshift
and the like, these SQL
focused teams who struggle with ETL
, testing, complex SQL
and warehousing … of course dbt
will be the light shining in the darkness to you.
Diving into dbt
.
So how do we even start learning dbt
? Let’s just start by listing some of the main ideas and concepts of dbt
, and casually mention how we might implement those features in real life. Then, once we are done doing that, let’s try to do some simple dbt
project in Postgres
. So what are the concepts to be aware of with dbt
?
- “A dbt project is a directory of
.sql
and .yml
files.” –dbt
docs - “A model is a single
.sql
file.” –dbt
docs - “Each model contains a single
select
statement.” –dbt
docs - Project (
.yaml
) file which configures and defines your dbt project. - “Tests are assertions you make about your models and other resources in your dbt project.” –
dbt
docs - “Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables.” –
dbt
docs - “Seeds are CSV files in your dbt project.” –
dbt
docs - “Models have materializations. (build strategies for your select query)” –
dbt
docs- view
- table
- ephemeral
- incremental
dbt
CLI makes it easy to dodbt
things at the command line.dbt
does documentation well.
I think we pretty much have a good idea of what to expect from dbt
with that list.
Example dbt
project with Postgres.
What better way to learn dbt
then to do a little project? What easier technology to use than Postgres
? First things, first, let’s set up a Docker
and docker-compose
setup that includes Postgres
and dbt
with some prepopulated schema and data that we can use for a sample dbt
project.
Ok, we got some grunt work to do, so let’s get to it. First, let’s build our Dockerfile
, which won’t be much.
FROM python:3
WORKDIR app
COPY . /app
RUN pip3 install dbt-postgres==1.2.2
Well, there’s that. Now, let’s get our docker-compose
going, which will be a little more involved, as we require Postgres
for our example.
version: "3.9"
services:
postgres:
image: postgres:10.5
restart: always
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
ports:
- '5432:5432'
volumes:
- ./postgres-data:/var/lib/postgresql/data
- ./DDL/create_tables.sql:/docker-entrypoint-initdb.d/create_tables.sql
healthcheck:
test: [ "CMD-SHELL", "pg_isready -U postgres" ]
interval: 5s
timeout: 5s
retries: 5
run:
environment:
- postgres_host=postgres
- postgres_database=postgres
- postgres_user=postgres
- postgres_password=postgres
image: "dbt-example"
depends_on:
postgres:
condition: service_healthy
volumes:
- .:/app
command: dbt --profiles-dir ./.dbt debug
Few things worthy of note that will be needed in our docker-compose
example to help with our example dbt
project. First, note under volumes
that we are calling a DDL
script to run when Postgres
spins up.
volumes:
- ./postgres-data:/var/lib/postgresql/data
- ./DDL/create_tables.sql:/docker-entrypoint-initdb.d/create_tables.sql
Here is that DDL
that includes a few values to get some real data into our Postgres table. We are using the open-source Divvy bike trip data. We have two tables, a raw table, a fact table, and a statement to load data into our raw
table when Postgres
starts up. This will allow us to write a simple dbt
model
that can move data from raw
state to a fact
model
.
I mean, this is the whole point after all.
CREATE TABLE IF NOT EXISTS trips
(
ride_id TEXT PRIMARY KEY,
rideable_type TEXT,
started_at timestamp without time zone,
ended_at timestamp without time zone,
start_station_name TEXT NULL,
start_station_id INTEGER NULL,
end_station_name TEXT NULL,
end_station_id INTEGER NULL,
start_lat NUMERIC(4,2),
start_lng NUMERIC(4,2),
end_lat NUMERIC(4,2),
end_lng NUMERIC(4,2),
member_casual TEXT NOT NULL
);
INSERT INTO trips VALUES
('85522531742CB311','electric_bike','2022-08-26 23:21:28','2022-08-26 23:48:03',NULL,NULL,NULL,NULL,41.95,-87.66,42.03,-87.67,'casual'),
('D789EFF6D43AB2A1','electric_bike','2022-08-26 19:25:46','2022-08-26 19:37:02',NULL,NULL,'Western Ave & Roscoe St','15634',41.93,-87.69,41.943034,-87.687288,'casual'),
('D21A8992F5A8CDA3','electric_bike','2022-08-26 19:14:07','2022-08-26 19:19:58',NULL,NULL,NULL,NULL,41.93,-87.7,41.92,-87.7,'casual');
Let’s get to the dbt
part.
Now that we have a framework setup that will allow us to do an example dbt
project, let’s get to the dbt
work. Apparently, the first thing we will need to do is add a profile
that can connect to our Postgres
warehouse. You should use environment variables to hold your credentials, in our profile.yaml
will just be using default Postgres
creds.
trips:
target: dev
outputs:
dev:
type: postgres
host: postgres
user: postgres
password: postgres
port: 5432
dbname: postgres
schema: postgres
threads: 2
Let’s also add a super basic dbt_project.yml file. Note, I added the model
I’m going to create later with a materialization of table
. Aka, I have raw
data I want to transform
into a fact
table.
name: trips
config-version: 2
version: 1.0
profile: trips
models:
trips:
raw_trips:
# materialize all models in models/events as tables
+materialized: table
Let’s go ahead and build our Dockerfile
so we can test this connection to our Postgres
warehouse.
docker build . --tag=dbt-example
The following command we will put in the docker-compose
file as what we want to run.
dbt --profiles-dir ./.dbt debug
Now we can run our test connect. Per below it looks like all is well, our Docker
container with dbt
can talk to our Postgres
warehouse.
>> docker-compose up run
Attaching to dbtsampleproject-run-1
dbtsampleproject-run-1 | 23:40:07 Running with dbt=1.2.2
dbtsampleproject-run-1 | dbt version: 1.2.2
dbtsampleproject-run-1 | python version: 3.10.7
dbtsampleproject-run-1 | python path: /usr/local/bin/python
dbtsampleproject-run-1 | os info: Linux-5.10.104-linuxkit-aarch64-with-glibc2.31
dbtsampleproject-run-1 | Using profiles.yml file at /app/.dbt/profiles.yml
dbtsampleproject-run-1 | Using dbt_project.yml file at /app/dbt_project.yml
dbtsampleproject-run-1 |
dbtsampleproject-run-1 | Configuration:
dbtsampleproject-run-1 | profiles.yml file [OK found and valid]
dbtsampleproject-run-1 | dbt_project.yml file [OK found and valid]
dbtsampleproject-run-1 |
dbtsampleproject-run-1 | Required dependencies:
dbtsampleproject-run-1 | - git [OK found]
dbtsampleproject-run-1 |
dbtsampleproject-run-1 | Connection:
dbtsampleproject-run-1 | host: postgres
dbtsampleproject-run-1 | port: 5432
dbtsampleproject-run-1 | user: postgres
dbtsampleproject-run-1 | database: postgres
dbtsampleproject-run-1 | schema: postgres
dbtsampleproject-run-1 | search_path: None
dbtsampleproject-run-1 | keepalives_idle: 0
dbtsampleproject-run-1 | sslmode: None
dbtsampleproject-run-1 | Connection test: [OK connection ok]
dbtsampleproject-run-1 |
dbtsampleproject-run-1 | All checks passed!
Writing our first dbt
model.
Ok, after all that hard work I think a nap is in order. But alas, in the words of Gandalf “The world is not in your books and maps, it’s out there.” So we must carry on. The Model
in dbt
apparently is where all the magic happens. You can read more about models
here, but it’s essentially just SELECT
statements that makeup the different transforms
we want to run.
The SQL
can be templated, the model can have different materializations as we mentioned before, etc. But, let’s keep it simple for our example. We just want a model
that will take our raw
table data and do some slight transformations
and then produce a fact
model
that could be materialized as a table.
SELECT
trips.ride_id,
trips.rideable_type,
trips.started_at,
trips.ended_at,
EXTRACT(EPOCH FROM (trips.started_at - trips.ended_at)) as trip_duration,
CASE WHEN trips.start_station_name IS NULL
THEN 'Unknown' ELSE trips.start_station_name
END as start_station_name,
trips.start_station_id,
CASE WHEN trips.end_station_name IS NULL
THEN 'Unknown' ELSE trips.end_station_name
END as end_station_name,
trips.end_station_id,
trips.start_lat,
trips.start_lng,
trips.end_lat,
trips.end_lng,
CASE WHEN trips.member_casual = 'member' THEN 1 ELSE 0 END as member_casual
FROM trips
Here is our very boring dbt
model
that doesn’t do much of anything besides some CASE
statements. But, since in our project
configuration we put our sterilization of this model
to table
, we should now have a raw_trips
table that exists with our transformed data.
By the way, I changed the docker-compose
command to be the following line to run the actual model
.
command: dbt --profiles-dir ./.dbt run
It all looks like it ran ok.
(venvrm) (base) danielbeach@Daniels-MacBook-Pro dbtSampleProject % docker-compose up run
[+] Running 2/0
⠿ Container dbtsampleproject-postgres-warehouse-1 Running 0.0s
⠿ Container dbtsampleproject-run-1 Created 0.0s
Attaching to dbtsampleproject-run-1
dbtsampleproject-run-1 | 13:39:59 Running with dbt=1.2.2
dbtsampleproject-run-1 | 13:39:59 Found 1 model, 0 tests, 0 snapshots, 0 analyses, 256 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
dbtsampleproject-run-1 | 13:39:59
dbtsampleproject-run-1 | 13:39:59 Concurrency: 2 threads (target='dev')
dbtsampleproject-run-1 | 13:39:59
dbtsampleproject-run-1 | 13:40:00 1 of 1 START table model postgres.raw_trips .................................... [RUN]
dbtsampleproject-run-1 | 13:40:00 1 of 1 OK created table model postgres.raw_trips ............................... [SELECT 3 in 0.27s]
dbtsampleproject-run-1 | 13:40:00
dbtsampleproject-run-1 | 13:40:00 Finished running 1 table model in 0 hours 0 minutes and 0.59 seconds (0.59s).
dbtsampleproject-run-1 | 13:40:00
dbtsampleproject-run-1 | 13:40:00 Completed successfully
dbtsampleproject-run-1 | 13:40:00
dbtsampleproject-run-1 | 13:40:00 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
dbtsampleproject-run-1 exited with code 0
Let’s jump inside our Postgres
image and see if that table exists with our transformed data.
psql -h 0.0.0.0 -U postgres
psql (14.1, server 10.5 (Debian 10.5-2.pgdg90+1))
Type "help" for help.
postgres=# SELECT * FROM raw_trips;
ride_id | rideable_type | started_at | ended_at | trip_duration | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual
------------------+---------------+---------------------+---------------------+---------------+--------------------+------------------+-------------------------+----------------+-----------+-----------+---------+---------+---------------
85522531742CB311 | electric_bike | 2022-08-26 23:21:28 | 2022-08-26 23:48:03 | -1595 | Unknown | | Unknown | | 41.95 | -87.66 | 42.03 | -87.67 | 0
D789EFF6D43AB2A1 | electric_bike | 2022-08-26 19:25:46 | 2022-08-26 19:37:02 | -676 | Unknown | | Western Ave & Roscoe St | 15634 | 41.93 | -87.69 | 41.94 | -87.69 | 0
D21A8992F5A8CDA3 | electric_bike | 2022-08-26 19:14:07 | 2022-08-26 19:19:58 | -351 | Unknown | | Unknown | | 41.93 | -87.70 | 41.92 | -87.70 | 0
(3 rows)
(END)
Well look at that, it worked. We can see our Unknown
values in our start_station_name
for those that were NULL
etc, and of course the model
was materialized as a table
as we configured.
Musings on dbt
for Data Engineers.
Short and to the point that one. I think that’s enough for most folks to get at least the basic understanding of dbt
and generally what it does and provides. Of course, it probably only scratches the surface and doesn’t explore things like tests
and documentation
. That being said, I do have a few comments and thoughts about dbt
as a tool for Data Engineers who’s pipelines include lots of SQL
.
- I would not switch my pipeline to all SQL just to use
dbt
. - If my pipelines included at least 30% or more of
SQL
, I would usedbt
. - The configuration and setup/use of
dbt
is extremely straightforward and easy to understand. - The concepts around
dbt
projects are very approachable. - The
testing
anddocs
ability ofdbt
are the icing on the cake. dbt
and it’smodels
don’t solve the nasty spaghetti SQL code that is the doom of many SQL-based data teams.
There is much to love about dbt
, and much to hate. I love what dbt
does for those teams that are mostly SQL
based, it gives the framework, testability, and generally a better approach to managing complex SQL pipelines. It’s better than just stored producers and ad-hoc SQL strung together in a never-ending trail of misery. But, I can see that it would not necessarily stop data teams from going down that path. You would have to be rigorous and die-hard in your approach to the models
and their complexity.
I can also see where the cost
of dbt
, the way it wants you to develop, could easily spiral out of control. If you’re writing small models
all the way down the line in your pipelines, and someone is materializing those to tables … your storage
and compute
cost is probably going to skyrocket. Reading and writing all that data … ugh. But, do the costs outweigh the benefits? Probably for some yes, for some no. It is interesting that a lot of complaints about dbt
costs have been surfacing lately, who knows if that is simply folks not implementing the technology correctly (usually the case), or if there is some truth to it all … is it simply a more expensive way to do things. Time will tell.