Gandalf’s Beard! DataFrames in Golang.
I’m not sure if DataFrames in Golang were created by Gandalf or by Saruman, it is still unclear to me. I mean, if I want a DataFrame that bad … why not just use something normal like Python, Spark, or pretty much anything else but Golang. But, I mean if Rust gets DataFusion, then Golang can’t be left out to dry, can it!? I mean I guess if you’re hardcore Golang
and nothing else will do, and you’re playing around with CSV
files, then maybe? Seems like kind of a stretch. But, I have a hard time saying no to Golang
, it’s just so much fun. Kinda like when Gandalf told them little hobbits and dwarfs to not stray from the path going through Fangorn Forest, those little buggers did it anyways. Code available on GitHub.
DataFrames in Golang, just because we can.
I still think it’s a stretch, but I guess we could probably find some use cases for needing DataFrames in Golang
in real life, maybe. One thing is for sure, it should be fast like everything else Golang
does. Let’s contrive an example so we can just play around with Golang
DataFrames and see what they have to offer, or horrible or pleasant it is to work on them.
Problem Statement
Let’s say one day your Over-Lord shows up in your Slack. This Over-Lord brandishes a freshly polished sword and exclaims that there is a new quest that needs to be completed, one that is perfect for such a peasant and lowly creature as yourself. Your Over-Load declares you must be subject to a recent edict from High-Powers that only Golang
can be used for such quests, and that using Python
or other non-static languages will be punishable by a gruesome and horrible death.
You, being the paltry knave you are, immediately agree to this most honorable quest, and execute it with much vigor.
The quest is to use Golang
to process incoming CSV
files that contain detailed bike trip data, filter them to members
only records, then aggregate the data by counting the number of bike rides per station
, and find the most popular stations. So …
- read incoming
CSV
files as a DataFrame. - filter the
CSV
tomember
only records. - count the number of bike rides per
.start_station_name
- order the results by the count decending.
Quest Begins.
package main
import (
"fmt"
"log"
"os"
"github.com/go-gota/gota/dataframe"
)
func main() {
csvfile, err := os.Open("data/202206-divvy-tripdata.csv")
if err != nil {
log.Fatal(err)
}
df := dataframe.ReadCSV(csvfile)
fmt.Println("df: ", df)
}
Well, it works, I had no doubt.
(base) danielbeach@Daniels-MacBook-Pro goframes % go run frames.go
df: [769204x13] DataFrame
ride_id rideable_type started_at ended_at ...
0: 600CFD130D0FD2A4 electric_bike 2022-06-30 17:27:53 2022-06-30 17:35:15 ...
1: F5E6B5C1682C6464 electric_bike 2022-06-30 18:39:52 2022-06-30 18:47:28 ...
2: B6EB6D27BAD771D2 electric_bike 2022-06-30 11:49:25 2022-06-30 12:02:54 ...
3: C9C320375DE1D5C6 electric_bike 2022-06-30 11:15:25 2022-06-30 11:19:43 ...
4: 56C055851023BE98 electric_bike 2022-06-29 23:36:50 2022-06-29 23:45:17 ...
5: B664188E8163D045 electric_bike 2022-06-30 16:42:10 2022-06-30 16:58:22 ...
6: 338C05A3E90D619B electric_bike 2022-06-30 18:39:07 2022-06-30 19:05:02 ...
7: C037F5F4107788DE electric_bike 2022-06-30 12:46:14 2022-06-30 14:12:48 ...
8: C19B08D794D1C89E electric_bike 2022-06-30 11:09:38 2022-06-30 11:10:25 ...
9: 6E9E3A041C14E960 electric_bike 2022-06-30 11:05:46 2022-06-30 11:09:11 ...
... ... ... ... ...
<string> <string> <string> <string> ...
Not Showing: start_station_name <string>, start_station_id <string>,
end_station_name <string>, end_station_id <string>, start_lat <float>, start_lng <float>,
end_lat <float>, end_lng <float>, member_casual <string>
Next, we need to apply the filter
to find member
only records.
members_df := df.Filter(dataframe.F{Colname: "member_casual", Comparator: series.Eq, Comparando: "member"})
It’s honestly a little annoying and verbose to filter an DataFrame
in Golang
, but I guess it is what it is. I mean generally, it makes sense, first pass in the column
, the comparision
operator, and then the value
to filter too. It just feels a little awkward.
Next, we need to aggregate by start_station_name
and then count the number of ride_id
s happening per
.start_station_name
station_groups := members_df.GroupBy("start_station_name")
station_rides := station_groups.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"ride_id"})
sorted := station_rides.Arrange(dataframe.RevSort("ride_id_COUNT"))
fmt.Println("df: ", sorted)
And the result.
(base) danielbeach@Daniels-MacBook-Pro goframes % go run frames.go
df: [1084x2] DataFrame
ride_id_COUNT start_station_name
0: 46090.000000
1: 3143.000000 DuSable Lake Shore Dr & North Blvd
2: 2964.000000 Kingsbury St & Kinzie St
3: 2811.000000 Streeter Dr & Grand Ave
4: 2737.000000 Wells St & Concord Ln
5: 2617.000000 Clark St & Elm St
6: 2580.000000 Theater on the Lake
7: 2575.000000 Wells St & Elm St
8: 2362.000000 Michigan Ave & Oak St
9: 2300.000000 Broadway & Barry Ave
Ok, the grouping
is pretty normal station_groups := members_df.GroupBy("start_station_name")
, but the aggregation
is a little wonky. station_rides := station_groups.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"ride_id"})
Again, a little verbose. I mean even DataFusion
with Rust
looks a little better, although not much. let df = df.aggregate(vec![col("member_casual")], vec![count(col("ride_id"))])?;
And it took about main took 6.897392875s
per the Golang
timer. I’m generally curious how this stacks up to just plain Pandas
with Python
.
import pandas as pd
from datetime import datetime
def main():
t1 = datetime.now()
df = pd.read_csv("data/202206-divvy-tripdata.csv")
df = df[df.member_casual == 'member']
df2 = df.groupby(['start_station_name'])['ride_id'].count().reset_index(name='count') \
.sort_values(['count'], ascending=False)
print(df2)
t2 = datetime.now()
print("it took {x} to run".format(x=t2-t1))
if __name__ == '__main__':
main()
And the performance … 02.991810
(base) danielbeach@Daniels-MacBook-Pro goframes % python3 test_with_python.py
start_station_name count
291 DuSable Lake Shore Dr & North Blvd 3143
489 Kingsbury St & Kinzie St 2964
959 Streeter Dr & Grand Ave 2811
1012 Wells St & Concord Ln 2737
189 Clark St & Elm St 2617
... ... ...
it took 0:00:02.991810 to run
Interesting, of course the Python
Pandas
is easier to read and write, but it’s also way faster than the Golang
DataFrame, although that probably says more about the gota
package. Just goes to show that when someone says Golang
or x
language will always be faster than poor old Python
, that they are forgetting about implementation
.
Musings on DataFrames with Golang
Although I was surprised that Python
s C
based Pandas
was faster than Golang
, at least the gota
implementation, I guess it’s not that surprising after all. A lot of work has gone into Pandas
, as compared to the newish gota
with Golang
. It is nice that to have the option to use DataFrames
in a fairly easy manner with Golang
, if you’re a poor old peasent who’s only at the beck and call of your Over-Lord who demands you use “better” and “fast” languages like Golang
.
It looks like there are plenty more features and function of gota
DataFrames with Golang
, although based on the implementation verbosity and the performance, I doubt I will ever use it again. I think I would prefer pretty much anything else first, Pandas
, Spark
, or even DataFusion
with Rust
. Code available in GitHub.