Musings on Python’s map() and filter()
I’ve always been surprised at the distinct lack of most Python code I’ve seen using the map()
and filter()
methods as standalone functions. I’ve always found them useful and easy to use, but I don’t often come across them in the wild, I’ve even been asked to remove them from my MR/PR’s, for no other reason then that they are supposedly ambiguous to some people? That’s got me thinking a lot about map()
and filter()
as related to readability, functional programming, side effects and other never ending debates where no one can even agree on the “correct” definition. Seriously. But, I will leave that rant for another time.
Reviewing Python’s map() and filter()
I have no desire to dig into the internals of how map and filter work. But, lets just review them quickly to set a baseline.
map()
If you’re going to run into one of these methods, you are most likely to come across `map()` before anything else. Map is very straight forward… it takes two arguments…
- function
- iterable (for example a list)
Also, map()
will return a iterable as well.
def who_is(name: str) -> str:
return f"{name} is a hobbit."
workload = ['Frodo', 'Billbo', 'Samwise']
hobbits = map(who_is, workload)
for hobbit in hobbits:
print(hobbit)
...
Frodo is a hobbit.
Billbo is a hobbit.
Samwise is a hobbit.
>>>
Nothing to earth shattering about that.
filter()
The filter()
method is exactly the same as map()
except it’s doing the opposite in a sense. It takes two inputs as well.
- function to test truthiness
- iterable.
It will return a iterable as well.
def who_is(name: str) -> str:
if "F" in name:
return f"{name} is a hobbit."
workload = ['Frodo', 'Billbo', 'Samwise']
hobbits = map(who_is, workload)
for hobbit in hobbits:
print(hobbit)
...
Frodo is a hobbit.
>>>
Musings on readability, testability, side affects and more.
I’m not totally sure if I buy this one, but I’ve seen it come up when objections are raised about using map()
and filter()
, and it usually takes the shape of some ambiguous statement about how it isn’t obvious what is happening in the code … especially if you throw a lambda
in the middle… which I’ve been known to do. Again, I sort of get it and I don’t.
- Yes, most people use
for
loops to do everything… even simple and small actions. - I don’t think
map()
orfilter()
are not “readable”, you just don’t use them so it makes you take a second look.
Let’s say you have a list or stream of customer records that require some slight ETL changes on ingestion into some system. You need to create a full name based on first and last names. Now most people are going to do this. Specifically the for record in records_stream:
def create_full_name(record: dict) -> dict:
first_name = record["first_name"]
last_name = record["last_name"]
record["full_name"] = f"{first_name} {last_name}"
return record
records_stream = [
{"first_name": "Billbo", "last_name": "Baggins"},
{"first_name": "Samwise", "last_name": "Gamgee"}
]
records = []
for record in records_stream:
record = create_full_name(record)
records.append(record)
print(records)
...
[{'first_name': 'Billbo', 'last_name': 'Baggins', 'full_name': 'Billbo Baggins'}, {'first_name': 'Samwise', 'last_name': 'Gamgee', 'full_name': 'Samwise Gamgee'}]
>>>
Now for sake of brevity and simplicity I’m not sure why you wouldn’t just write….
def create_full_name(record: dict) -> dict:
first_name = record["first_name"]
last_name = record["last_name"]
record["full_name"] = f"{first_name} {last_name}"
return record
records_stream = [
{"first_name": "Billbo", "last_name": "Baggins"},
{"first_name": "Samwise", "last_name": "Gamgee"}
]
records = map(create_full_name, records_stream)
print([record for record in records])
...
[{'first_name': 'Billbo', 'last_name': 'Baggins', 'full_name': 'Billbo Baggins'}, {'first_name': 'Samwise', 'last_name': 'Gamgee', 'full_name': 'Samwise Gamgee'}]
>>>
Readability is important, but I don’t think map
or filter
would violate that. I usually see the readability pushback with combining a map
and filter
with a lambda
. Something like this. Although my lambda
is very wordy, imbedding such a thing inside a map
or filter
like below I suppose can be less obvious.
records_stream = [
{"first_name": "Billbo", "last_name": "Baggins", "treasure" : 50},
{"first_name": "Samwise", "last_name": "Gamgee", , "treasure" : 20}
]
records = map(
lambda record: {
"first_name": record["first_name"],
"last_name": record["last_name"],
"full_name": record["first_name"] + " " + record["last_name"],
},
records_stream,
)
print([record for record in records])
...
[{'first_name': 'Billbo', 'last_name': 'Baggins', 'full_name': 'Billbo Baggins'}, {'first_name': 'Samwise', 'last_name': 'Gamgee', 'full_name': 'Samwise Gamgee'}]
>>>
When it comes to testability, I feel like if you write code that code can be tested. As long as you can wrap something in a def
you can probably easily pytest
it.
def add_full_names(records_stream: list) -> iter:
records = map(
lambda record: {
"first_name": record["first_name"],
"last_name": record["last_name"],
"full_name": record["first_name"] + " " + record["last_name"],
},
records_stream,
)
return records
def test_add_full_names():
test_stream = [{"first_name": "bing", "last_name": "bong"}]
outputs = add_full_names(test_stream)
print([output for output in outputs])
assert [output for output in outputs] == [{"first_name": " bing", "last_name": " bong", "full_name": "bing bong"}]
When it comes to functional testing, side effects and all the other things those Scala
folks talk about I’m not sure if the above function
violates any of those rules or not. I generally try to only mutate data once in a single function, which is really what I’m doing above, just because the map
and lambda
exist in the same definition I’m not sure if that matters. I always at least loosely try to follow the no side effects idea. There is nothing worse then a giant function that should be 5 functions, it’s always messy, hard to read, and hard to know what’s happening or supposed to happen.
It seems that by default map
and filter
are always mutating the data input, so in theory a function should only include the map
and filter
and nothing else. Which maps and filters are so concise it’s kinda hard to not keep going.
Performance of map
and filter
vs for
loops.
I’ve often wondered about the performance difference, if any between using a map
and a for
loop. I’m sure I could just google it but the proof is in the pudding. I’m going to use the ole’ Divvy Bike trips open source dataset to get a decent sized csv file to iterate rows on. There are 550,000
rows in my dataset so that should be good enough.
Let’s say I wanted to return all the values where someone rented a bike for more then 1 day.
import csv
from datetime import datetime
def date_diff(row: list):
start = datetime.strptime(row[2], '%Y-%m-%d %H:%M:%S')
end = datetime.strptime(row[3], '%Y-%m-%d %H:%M:%S')
diff = start - end
if diff.days >= 1:
return diff.days
else:
return 0
def read_csv_file(csv_path: str) -> object:
f = open(csv_path, "r")
csv_reader = csv.reader(f)
next(csv_reader)
return csv_reader
if __name__ == "__main__":
csv_reader = read_csv_file("202007-divvy-tripdata.csv")
t1 = datetime.now()
days = []
for row in csv_reader:
day = date_diff(row)
days.append(day)
sum([day for day in days])
t2 = datetime.now()
print(t2-t1)
Above is the simple for loop
and below is what I changed to add the map
.
if __name__ == "__main__":
csv_reader = read_csv_file("202007-divvy-tripdata.csv")
t1 = datetime.now()
days = map(date_diff, [row for row in csv_reader])
sum([day for day in days])
t2 = datetime.now()
print(t2 - t1)
Well, didn’t see that coming. Maybe you did? Apparently the plain old for
loop is faster then the fancy map
. I suppose it isn’t that big of a deal, but if your writing custom distributed Python on Kubernetes like me…. then it does kinda of matter which is faster. I suppose little choices like this add up over time, and on top of bit data would end up costing you more money. Below is the graph, I ran each set of code 5 times.
Musings in conclusion.
Maybe it’s just because I’ve been writing Scala lately and appreciate concise statements and less lines of code. Part of me is ok with writing map
and filter
even with a nasty lambda
in the middle…. just because I like the way it looks. But, knowing at scale that the basic for
loop is faster… well that does give me pause.