How to JOIN datasets in Polars … compared to Pandas.
It’s been a while since I wrote about Polars on this blog, I’ve been remiss. Some time ago I wrote a very simple comparison of switching from Pandas to Polars, I didn’t put much real effort into it, yet it was popular, so this is my attempt at trying to expand on that topic a little.
Recently, while laying flat on back on my sunporch soaking up the vitamin D beating down on me, dreaming about code, which I always do, it struck me.
I’ve used Polars a fair amount lately and even was able to bring it into Production at work … which I was happy to do. But, I realized there was one function I had never done before, that is … JOIN two datasets in Polars. I know it sounds like a silly thing, but as Data Engineers we spend a fair amount of time joining datasets, might as well if Polars gets it right or not. I was simply laying in the sun wondering if it’s any easier in Polars.
Let’s just do it and then talk about it. Maybe it will be short and sweet, maybe it won’t.
Comparing dataframe JOINs in Pandas vs Polars.
I think it’s good to do comparisons, it helps us just appreciate and learn a little better. Let’s start by joining two dataframes with Pandas, that popular Python package that is the bane of all Data Engineers.
So that was easy enough in Polars, using the MERGE method. It’s sort of weird, isn’t it? Most people who use Spark are probably used to the JOIN method. Let’s try that in Pandas, the JOIN method that is.
The JOIN method in Pandas can be a little strange if you are not used to it, it will default to using the dataframe INDEX. But the syntax otherwise is more familiar and normal.
JOINs with Polars.
Let’s try this with Polars, maybe it will be different, or maybe it won’t.
Pretty much the same, nothing different.
Yes, but is Polars or Pandas faster at doing JOINs?
Well, since the JOIN is no different between Pandas and Polars and we don’t want to end in an anti-climatic manner, so we should probably ask the next best question. Which tool is FASTER at doing JOINs, Polars or Pandas?
To do this in a manner that makes sense, so we can see real results, we are going to use the BackBlaze open source harddrive data set, about 9GBs worth of data to do our test.
This is what the data looks like.
We are going to need to manufacture a JOIN we can test both Polars and Pandas with.
I think what we should do is read the core dataset and make a new one of just the model
and the capacity_bytes
, and then pretend like we need to join the main dataset to this new dataset to pull in the capacity_bytes
field.
Get it? I hope so.
I can do this in Polars by simply …
subset = df.select('model', 'capacity_bytes').unique()
after reading the raw dataset.
Once that is done, we have the two datasets we join in Polars and get a runtime.
>> print(“it took {x}”.format(x=t2-t1))
it took 0:09:24.964097
9 and half minutes to do the join. In case you were wondering on the total amount of rows to get some context.
>> combined.select(pl.len()).collect()
shape: (1, 1)
┌──────────┐
│ len │
│ — │
│ u32 │
╞══════════╡
│ 47,862,206 │
└──────────┘
I know with our Polars code it’s hard to know how much time it took for the join vs the file IO, but who cares. It got the job done.
Ok, I am really wondering what will happen with Pandas, if it will even do it at all.
And the results?
Killed
… of course, it won’t do it. It appears to die trying to read the multiple CSV files and combine them into a single dataframe.
You know, it’s amazing what a crappy tool Pandas is in the grand scheme of things, and to think that we’ve all been using it for a decade with no other option than to jump straight to Spark to solve these problems. I mean really? 9GB isn’t that much data you know.