Date and DateTime Manipulation in Polars
One thing all Data Engineers are doomed to do in purgatory will be to solve different date
and datetime
problems in an endless loop. I’m sure of it. I can’t imagine anything worse, so that must be it. Either way the constant need to manipulate date
s and datetime
s are just a way of life, something that never ends and never changes. Also, it appears Polars is here to stay from what I can tell. Not a fad like that Data Mesh. Since Polars is here to stay, (I’ve already got it running in production at my company, (don’t mind if I bow)), we should probably take a gander at how to manipulate date
and datetime
objects from both the Dataframe and (if I have time) SQL perspective. See if we can find anything to complain about. I like to complain.
Polars dates and datetimes
You know, before I dig into the documentation and start writing code, to be fair, I should just come up with a list of things that normally take place with date
and datetime
objects, and then just make sure Polars can do those things in a reasonable manner. Rather than starting backward and starting with the documentation. I don’t want to cloud my mind. I want just to have a set of things I normally do with dates, and then see how easy or not it is to do those things in Polars. That is the best and fairest test probably.
Well, here goes nothing. All code on GitHub.
- convert a
string
to adate
- convert a
string
to adatetime
- convert a
date
ordatetime
back to astring
- pull out
year
,month
, andday
from adate
ordatetime
object. - pull the
quarter
out of adate
ordatetime
object. - get the
diff
between twodate
ordatetime
objects. - Add an arbitrary number of
days
to adate
ordatetime
object.
That’s basic enough and enough for us probably, at least to find out if Polars
is coming up to snuff for basic manipulations of date
or datetime
objects.
Polars Python code for date
and datetime
manipulation.
Who knows, maybe this will be boring, but there is no time like the present, let’s just dive into each one of the items above. I will attempt to solve each problem with both the Dataframe API and (if I have time) SQLContext.
convert a string
to a date
This isn’t too bad with the Dataframe API. Basically, if we have a Dataframe with a String
column, we can just .str
and then .to_datetime()
on that. By default, it tries to figure out the format itself, which is straightforward if it’s normal text.
Otherwise, you can pass an format
if need be and pass the normal chrono
type formats like `%Y-%m-%d %H:%M:%S` that we are all used to.
I’m curious about the strange .str.to_datetime()
requirements in Python, as someone who writes Rust
, it feels very … well … Rusty
to me. Not Pythonic at all. Makes no sense to me. It should be simply pl.col('column').to_datetime()
and error if the data types are wrong or something.
I get it though, and understand why they do that.
Before I go any further, I’m curious what happens if I simply. just .cast('date')
on the string
column. Will that shortcircuit the extra call to .str.to_datetime()
?
Yeah … no, that was too much wishful thing there. Although it did give an interesting message … consider using `strptime` well … don’t mind if I do there Sunny Jim. Looks like it works. I’ll be buggered if I know the difference between to_datetime()
and strptime()
… aka why you would use one over the other.
Well, we pounded that dead horse a little much. Let’s try a string
to date
with SQL
in Polars and see what happens.
I did some hunting trying to find what kind of SQL
is supported by Polars. Apparently not much.
“Polars does not support the full SQL language” – docs
I had heard rumblings on Linkedin some time ago that the new release of Polars included a DATE()
method in SQL. So let’s give it a try. I could not find a reference to it in any of the documentation that I looked at. Buggers. By George, it worked! Three cheers for Polars SQL!
convert a string
to a datetime
I was going to spend some time here, but I think I will not. I will show a simple example and move on. Since the string
to date
worked, we can assume based on one example that string
to datetime
probably works well. This is accomplished with .str.to_datetime()
.
Let’s move on.
convert a date
or datetime
back to a string
One of the other common things I find myself doing is the opposite of what we’ve just done, taking some date
or datetime
and give me a string
back. Comes in handy. This can be done with .dt.strftime()
pull out year
, month
, and day
from a date
or datetime
object.
Well, yet another simple but important date
buggery we all do quite often is pull year
, month
, or day
from a date
or datetime
object. Being able to do this is useful in all sorts of situations.
Very handy indeed. Yes, there is every support for things like quarter
with dt.quarter()
, other stuff as well, like minute()
, weekday()
, month_start()
, month_end()
, and other little tidbits. A good sign indeed for Polars.
get the diff
between two date
or datetime
objects.
I did have trouble finding the ability to do this. I saw some StackOverflow posts about people using numpy
inside Polars to accomplish this and were complaining about its slowness on large datasets. I’m interested in if Polars can do this by itself without any help.
Getting a date
diff is a very common problem to solve.
I did find an interesting method called duration
, docs here. This is something akin to the ability to add or subtract a temporal unit from an existing date
or datetime
object. Something like below you can add a day
. I’m glad I found this, not what I was looking for, but very useful.
Then I had a thought. What if Polars is so wonderful I can simply just subtract one
date
from another. Would this work? Hammer of Thor it does!
So nice!!!
Conclusion.
That was a wonderful breath of fresh air for sure! I was able to pretty much easily figure out and implement all the basic date
and datetime
manipulations in Polars without much trouble at all. Nothing seemed overly strange and wonky to figure out and implement.
It appears Polars has first-class support for date
and datetime
manipulation. Polars truly is a production-ready package, ready to beat Pandas into the dust and far distance past. I’ve already implemented it in Production where I work. What are you waiting for? Get to it.