Golang – Useful for everyday Data Engineering?
I periodically try to pick up a new programming language on my journey through Data Engineering life. There are many reasons to do that, personal growth, boredom, seeing what others like, and helping me think differently about my code. Golang has been on my list for at least a year. I don’t hear much about it in the Data Engineering world myself, at least in the places I haunt like r/dataengineering and Linkedin.
I know tools like Kubernetes and Docker are written with Go, so it must be powerful and wonderful. But, what about Data Engineering work … and everyday Data Engineering work at that, is Go useful as an everyday tool for everyday simple Data Engineering tasks? Read on my friend.
Goland for every day Data Engineering.
Before all the crazy people from the internet start bothering me about my Go
code, let me be clear, this is me learning … and I’m learning from a certain viewpoint. I want to know if Golang
is reasonably usable for day to day Data Engineering tasks …. for me. You can decide yourself what you think about Golang
. ( all code below on GitHub)
For me, and what I do day to day, learning a new lanague comes down to a few main points.
- What’s the learning curve like.
- How hard is it to do simple tasks.
- How does the language make think about solving problems.
- How does the lanauge seem to fit data pipelines.
Sure, that’s probably not why a lot of folks use Go
, but for me, that’s what I would use it for. On the flip side I always enjoy learning a new language, it helps me be more expressive and a better problem solver in my daily Data Engineering life. Each language has it’s own nuiances and favors certain ideas and approaches. Combined, this sort of learning is good for the soul and mind.
First Example Golang Project – Reading CSV files.
One of the first tasks I always try to complete when working with a new language is processing some CSV
files. It’s usually pointless CSV
processing, but it’s more for the learning and the experience, to get a feel for the lanauge, in this case Go
, and start to get an understanding of how easy or difficult certain things are in each lanauge.
I find it a good indicator if it’s “easy” to process a CSV
file. We are simply going to read some CSV
files from the free Divvy bike trip data set. We will read each file, and count the number of member
records each file contains.
All that being said, let’s take a look at my first attempt at writing Go
and then talk about.
package main
import (
"encoding/csv"
"fmt"
"io/fs"
"io/ioutil"
"log"
"os"
"path/filepath"
"strings"
)
func read_dir() []fs.FileInfo {
files, err := ioutil.ReadDir("data")
if err != nil {
log.Fatal(err)
}
return files
}
func get_paths(files []fs.FileInfo) []string {
var fs []string
for _, f := range files {
thepath, err := filepath.Abs(filepath.Dir(f.Name()))
if err != nil {
log.Fatal(err)
}
if strings.Contains(f.Name(), ".csv") {
fs = append(fs, string(thepath)+string("/data/")+string(f.Name()))
}
}
return fs
}
func read_csv(filePath string) [][]string {
f, err := os.Open(filePath)
if err != nil {
log.Fatal(err)
}
defer f.Close()
csvReader := csv.NewReader(f)
records, err := csvReader.ReadAll()
return records
}
func work_records(records [][]string) {
sum := 0
for _, r := range records {
if r[12] == "member" {
sum += 1
}
}
result := fmt.Sprintf("the file had %v member rides in it", sum)
fmt.Println(result)
}
func main() {
start := time.Now()
fs := read_dir()
paths := get_paths(fs)
fmt.Println(paths)
for _, p := range paths {
rcrds := read_csv(p)
work_records(rcrds)
}
duration := time.Since(start)
fmt.Println(duration)
}
>> go run csv.go
[/Users/danielbeach/code/csv_go/data/202004-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202005-divvy-tripdata.csv /U
sers/danielbeach/code/csv_go/data/202006-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202007-divvy-tripdata.csv /User
s/danielbeach/code/csv_go/data/202008-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202009-divvy-tripdata.csv /Users/d
anielbeach/code/csv_go/data/202010-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202011-divvy-tripdata.csv /Users/dani
elbeach/code/csv_go/data/202012-divvy-tripdata.csv /Users/danielbeach/code/csv_go/data/202101-divvy-tripdata.csv /Users/danielb
each/code/csv_go/data/202102-divvy-tripdata.csv]
the file had 61148 member rides in it
the file had 113365 member rides in it
the file had 188287 member rides in it
the file had 282184 member rides in it
the file had 332700 member rides in it
the file had 302266 member rides in it
the file had 243641 member rides in it
the file had 171617 member rides in it
the file had 101493 member rides in it
the file had 78717 member rides in it
the file had 39491 member rides in it
>> 4.583988333s
Thoughts on my first Golang
script.
I honestly loved my first experience writing Go
for my silly little CSV
pipeline, I learned a lot about Go
and got a decent feel for how things fit together. When I think back to learning Scala
for the first time, for example, Go
seemed to be a little bit more approachable for me.
Here are some the first things I noticed …
- common imports like
encoding/csv
andstrings
, evenpath/filepath
make simple tasks easy. - easy to define functions and types.
- catching errors is easy.
- syntax is easy and straight forward.
Even thinking back to my first time writing Scala, this Go
script was just more straight forward to write, being able to process a CSV
file. It might not seem like much, but having a package like encoding/csv
where I can simply and easily load a CSV
file …
csvReader := csv.NewReader(f)
records, err := csvReader.ReadAll()
return records
It’s refreshing and to me is a good sign that Go
can solve simple tasks in a simple way, making it a decent choice for every day Data Engineering. Again, another sign of Go
‘s usefulness in common DE tasks was the nice strings
module …
if strings.Contains(f.Name(), ".csv")
It’s the simple things in life that make things easier. I’m a fan of Golang
. But, out of curiosity what would this code look like in Python? I’m mostly curious about the performance.
import csv
from glob import glob
from datetime import datetime
def get_files(dir: str = 'data') -> list:
files = glob(f'{dir}/*.csv')
return files
def read_csv(file: str) -> iter:
with open(file, "r") as f:
reader = csv.reader(f)
next(reader, None) # skip header
rows = [row for row in reader]
return rows
def work_records(records: iter) -> None:
total = 0
for record in records:
if 'member' in record[12]:
total += 1
print("the file had {v} member rides in it".format(v=str(total)))
def main():
t1 = datetime.now()
files = get_files()
for file in files:
records = read_csv(file)
work_records(records)
t2 = datetime.now()
print(f"{t2}")
main()
>> python3 main.py
the file had 171617 member rides in it
the file had 101493 member rides in it
the file had 61148 member rides in it
the file had 302266 member rides in it
the file had 78717 member rides in it
the file had 39491 member rides in it
the file had 282184 member rides in it
the file had 188287 member rides in it
the file had 113365 member rides in it
the file had 332700 member rides in it
the file had 243641 member rides in it
13:57:54.929582
Yikes! Sure that Python
code is a little cleaner, but man that Go
is way faster! 4.583
seconds for Go
compared to 13:57
for Python. Of course I’m not surprised by that, I figured Go
would be faster.
What get’s me excited about Go
is not only is it way, way faster, but also that the Go
script itself was easy to write, and that for a beginner!
Musings on Golang as a day-to-day Data Engineering tool.
I’m excited to continue to learn Golang
, it seems like a fun tool tool to use and write. I’m looking forward to testing some more code with Go
, like maybe doing some more http
requests and file
manipulations. I’m curious about its integrations with cloud tools like aws
, it’s concurrency options, and just generally how well it will continue to perform and how easy it will be to use.
At the end of the day learning Go
is going to be a good exercise in keeping myself moving forward, thinking in new ways, and solving problems with a new tool that will keep me agile and open minded. I like the syntax
and data structures so far, it’s easy to understand and use, I feel the learning curve is less then what I experienced with Scala
.
Golang
gets a big thumbs up for me, you will see more Go
in my blogs in the future!
Big fan of your posts, look forward to reading more of them.