Datafusion SQL CLI – Look Ma, I made a new ETL tool.
Sometimes I just need something new and interesting to work on, to keep me engaged. A few days ago I was lying by the river next to a fire, with the cold air blowing on my face and the eagles soaring above. Thinking about and contemplating life and data engineering … something flitted across my mind, just a little fragment of an idea someone had written about.
The little fragment had to do with Datafusion, a Rust-based query engine, and something about it having a SQL CLI interface.
What an interesting thing. I’ve used Datafusion a few times, here and there, I love Rust because it’s fast. I’m a Data Engineer so I’m eternally enslaved to SQL whether I like it or not. This whole thing just seemed like an interesting little tidbit to poke at.
It basically made me wonder if I could combine the Datafusion SQL CLI with bash
into a new ETL tool. Simple, small, fast, and maybe fun? Just because I can?
Strange Ideas and Learning
If you’re anything like me you need a reason to learn something new, or you get bored and like to do strange things no one else will. Probably one of my greatest joys in life is poking and prodding at new tools, finding uses for them, doing things, and getting things done others won’t.
I feel like this is one of those perfect ideas for the sorta work.
I mean, really, who needs a SQL CLI tool??!!
In the age of the Databricks Notebook and Polars, I’m just not sure a CLI tool for SQL makes a ton of sense on the surface. At least for the average Data Engineer going about the daily humdrum. But then I thought about it more.
What are CLI tools good for? Building other tools for one.
Enter my idea to build some ETL with bash
plus this Datafusion CLI SQL tool.
Making it work.
This is what I have in mind, who knows if it is something we can pull off or not, but it does seem awfully attractive.
I mean think about it, if we had to lick our finger and put it in the wind, like the old timers, I would venture a guess that a good 80% of data pipelines and ETL are running SQL of some sort or other. I mean who knows what people are doing, but again I would guess they are …
- Ingesting data into a database, running SQL, and pumping some data back out at some point.
- And if not a database, maybe SparkSQL, or something else that has a SQL API using data lakes or lake houses.
I’m here to simplify the architecture a little. Over the years I’ve learned that we are all creatures of habit and simply do what we know and what is easy.
We don’t think out of the box very well.
I have no idea if this going to work, but here we go.
Building my new ETL tool.
First things first, let’s build a Dockerfile
that will hold what we need to get this done.
Also, we will need some data, let’s go with Divvy Bike trips open data source files on s3. Here is a shot of what that data looks like.
I pre-downloaded a file and made sure it was built into the Dockerfile
. Here is the Dockerfile
.
So we aren’t doing much here … Ubuntu Dockerfile
with Rust
and cargo
installed, of course coping over files from locally and making sure our bash
file is executable.
Bash + Datafusion CLI
Next, we need to write our little Bash
script that will encapsulate our datafusion SQL CLI
command and hopefully capture the output to a file. I mean what is ETL if not that?
- read something
- transform or do analytics
- save output
If you are unfamiliar with Datafusion, I wrote about it before, go read. It’s a Dataframe library written in Rust, more or less, and comes with a nice CLI
tool that can be passed SQL
and it will do its thing. It seems only useful for interactive data exploration, or making tools like we are doing.
I don’t think that the Bash
file will be complicated. Really just do a few things …
- gather input variables
- build
datafusion
command - capture output to file
I mean, come on, this is just for fun, not to be fancy.
That should work.
Now, let’s run the Bash
script and have a SQL
query run against this file, and save the results.
The query I want to run in some simple analytics, Datafusion
makes it easy, just referencing the data file I’m interested in.
So, nothing for it but to give it a try … here is the the command I ran inside the Dockerfile
.
Of course, we see the output, problem is we are probably going to have extra text in the file.
Yup, that doesn’t work like we want. Let’s do two things …
- Remove my silly echo messages
- Add the
--quite
flag to ourbash
command
Let’s see if that gets rid of the junk.
It worked! Nice!
We now have a nice flat-file
with our analytics.
Reality
What does this all mean? You’re probably just pooh-poohing me with a who cares and all. Sure, but it’s always fun to do something new and try a new tool like the Datafusion
‘s SQL CLI.
It’s important to do little fun projects like this for a few reasons.
- The
datafusion
SQL CLI is a great tool that could be built into other tools for Data Engineers- Think about all the possibilities for custom tools that use SQL (that should get you excited).
- Everyone uses
SQL
all the time in many pipelines. - Typically we use heavy solutions to enable the ability to run
SQL
in a reliable and fast manner.- Rust is fast and reliable.
- This tool is lightweight.
Also, don’t forget, we have barely scratched the surface of the features of datafusion
SQL CLI, and it can do cloud buckets like AWS
s3 buckets, directories of paruqet
files, etc. So it’s a serious tool that can do serious things.
Are you looking to save costs and process files easily with some ETL? Here is an option.
There other lessons is for younger data engineers who might want to learn and grow. The only way to do this is to write code and do things that make you stretch. They don’t have to be great or huge projects. Like this, they can be small and fun. I got to use bash
, Docker
, SQL
etc. To learn and grow, you must do.