, ,

Ownership and Borrowing in Rust – Data Engineering Gold Mine.

As I started to use Rust on and off, more out of curiosity than anything, I discovered some specs of gold buried down in the depths. Some of the things I’m going to talk about, well … all of it, is probably fairly obvious to most Rust folk, but it’s enjoyable to learn what new languages have to offer and ingest that knowledge into what we do, in this case, Data Engineering. There are some special things about Rust that can us all write better data pipelines and transformations.

Just like Scala brought immutability to legions of Data Engineers, Rust is going to bring Ownership and Borrowing through its memory model. Like some ancient King traveling lands throwing handfuls of coins to beleaguered subjects, groveling on the ground for scraps, such is Rust traveling the weary lands of Data Engineering.

Why Rust for Data Engineering?

I’ve written recently about using Rust for Data Engineering, so for a more in-depth discussion of my thoughts on the topic, go peruse that article. I think what it boils down for me, related to the question of the “why” Rust for Data Engineering … the line between Software Engineering and Data Engineering is starting to shrink quickly.

In one sense Data Engineering has never been easier, and everyone and everything has a Python API making the learning curve flat and easy. On the other end of the spectrum, Data Engineering as a culture has started to fully embrace Software Engineering because of the complexity and explosion of data, and the way data took over the world via Machine Learning and AI.

Why Rust for Data Engineering?

  • Performance ( it’s fast! )
  • Low learning curve ( it’s not as bad as you think ).
  • Type Saftey.
  • Ownership and Borrowing protect you.
  • Great community and build system ( crate ).

Ownership and Borrowing in Rust – for Data Engineers.

I want to be clear about something, you probably shouldn’t listen to someone who’s been piddling around with Rust for a few months when it comes to Rust stuff. That being said, I do want to talk a little bit about Ownership and Borrowing in Rust, and how it can help us as Data Engineerings write better code with fewer bugs.

I want to focus on how a lot of Python data pipelines are written because that’s probably the majority, and the issues that can arise with this approach, then write the pipeline in Rust, showing how the concepts of Ownership and Borrowing can make data pipelines easier to reason about, and more importantly, reduce common bugs and gotchas.

Data Pipelines with Python – Easy but with downsides.

For simplicities sake, let’s say we are working with some tabular data in CSV files. I will be using Backblaze open-source hard drive data sets. This is some normal-looking data, and we will apply some normal transformations.

To make this easy, let’s just pretend we have to do some separate transformations to change the data types of our CSV dataset.

Or maybe if someone is feeling spicy it might be this.

The problem with this sort of approach to data pipelines with Python is that usually, the transformations are much longer, more complex, and in-depth. This leads to several problems in large complex data pipelines.

  • It’s easy to miss-reference the passing around of variables/objects which won’t be caught by unit tests.
  • It’s hard to reason about and debug, because anything can change and be mutated, overwritten, or whatever, at any point in time.

I’m fairly certain even the most seasoned Python engineer can relate to this. This is where Rust can help us. Ownership and Borrowing.

Yet a more Excellent Way.

Ownership and Borrowing in Rust.

This is where Rust can come in to save the day for Data Engineers and complex data pipelines. Do you want something that will make it easy to de-bug, troubleshoot, and understand code bases and pipelines? Do you want to have to think harder about what you are assigning, passing around, and using (objects and variables)? If so, Rust with its model of Ownership and Borrowing is here to save you.

What is Ownership and Borrowing in Rust?

“Each value in Rust has a owner.”

“The variable is valid from the point at which it’s declared until the end of the current scope.

“… represent references, and they allow you to refer to some value without taking ownership of it.”

Rust docs.

This ownership and referencing/borrowing black magic is why people talk about Rust being “memory-safe.” Rust makes it hard to do things wrong, the compiler is smart and will complain. Rust makes you think about each variable and object in a new way, compared to Python that is. You can’t just willy-nilly change objects, pass them around, and do whatever you want to them without thinking. Rust MAKES you think and write exactly what you mean to do, whatever that may be.

What do I mean? Let’s take a look. We will write the above data pipeline example in Rust, and break some stuff.

Data Pipelines with Rust and Datafusion.

So what does this stuff have to do with writing data pipelines in Rust? How does it make a pipeline better, or an engineer for that matter? Well, honestly, it just makes you think more about what you’re doing. As someone who uses Python all day long, every day, it has its good and bad parts.

I can write volumes and reams of code in short order like it’s nothing. Sure, I unit test, and follow best practices, but you pay a price for ease. The life of the rich and famous comes with a price. In the case of Python, it’s mistakes, because we are all human after all.

Rust changes that.

“You have to think more clearly and be more explict about exactly what is happening and why when writing data pipelines in Rust.”

Let’s take a very simple and single example of what this looks like in real life. How Rust data pipelines are different. I will write a simple few lines of Rust using Datafusion. Read some CSV data, do some transforms, and you get the idea.

Here we go, here is the main function. Read some CSV file into a Dataframe, then do two things with it. See? Make a Dataframe called base and then pull_metrics and also pull_some_other_metrics.

What happens when I try to cargo build this code? Rust doesn’t let me.

Here is the logic behind those functions, not that it matters much.

Anyway, what happens when I try to build this code? Nothing good.

It says … error[E0382]: use of moved value: `base`. 

What’s happening? Rust is trying to protect us, and the compiler is so smart that it tells you specifically what is wrong.

—- move occurs because `base` has type `impl Future<Output = Arc<datafusion::dataframe::DataFrame>>`, which does not implement the `Copy` trait

12  |   let df = pull_metrics(base.await);

    |                             —— `base` moved due to this method call

 

It’s saying we used base here in pull_metrics, its Ownership changed. And then later on with pull_some_other_metrics we trying using it again.

 

let df = pull_some_other_metrics(base.await);

    |                                   ^^^^ value used here after move

What’s the point?

The point is that Rust is making you think, it’s protecting you … from you. In much larger code bases and data pipelines, with say many Dataframes of various sorts, with a multitude of transformations here and there, things can go wrong very quickly and quietly.

How many times in Python have I passed some object I thought was one thing, but it was another, or maybe it was one thing at one point, but become something else without me catching on. Sure you say, just write better Python. Well … I’ve been trying to write better Python for a decade+. The problem with me is that I’m me.

Conclusion.

What I wanted to convey with this simple and cursory overview of Ownership and Borrowing in Rust was to simply make you think. Think of all the bugs and problems it can solve in everyday data pipelines being written all around this big old world. Rust isn’t as hard to write as you think, sure, it’s harder than Python, but there is a payback.

The payback is that Rust will help you think more about what your doing … being intentional. Sure the type system will make you think more but beyond that, the Ownership and Borrowing memory model of Rust will absolutely trip you up in the beginning … in a good way.

It will make you say “What am I don’t with this object?” “Who has control of this object?” “Can I borrow this object?” These types of questions reduce the bugs we introduce into our data pipelines and make you think more. Thinking is good.