There are a few things in life I both love and hate. Let’s see …. hot weather, cold weather, working for a living, and …. LeetCode. I mean it is totally fun to push yourself and try to solve hard problems, but then the other side of me is like … well I’ve been writing code for years and 80% of this stuff is nothing like writing code in real life. I think the LeetCode platform itself is an amazing tool, and has provided both people and companies with an elegant way to showcase and practice skills. But is there too much of a good thing? Of course.

Read more
Apache Beam for Data Engineers.

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across anyone in the wild using it yet. On the surface it appears to be semi-pointless since it runs on-top of other distributed systems like Spark, but I’m sure there is more to it. Today, I’m going to run through an overview of Apache Beam and then try installing and running some data through it, kick the tires as it were. And see if my mind changes about the pointless bit.

Read more

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a good Data Engineer. Just the life of a Data Engineer in general. The Battlefield of the Data Engineer is fought in the labyrinth of nested SQL queries. It rages to the depths of distributed computing clusters. It vies for victory on the crags and peaks of DevOps. It attacks for precious ground amid the chaos of the perfect OOP and Functional code bases. Phew… and all that just to keep your head above water.

Read more
Apache Parquet vs Apache Avro

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with the rise of distributed data processing engines like Spark.

Read more

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard of it, but it’s usage seems to be sporadic at best. I’m going to talk about what makes Apache Airflow the perfect tool for any Data Engineer, and show you how you can use it to great effect while not committing to it completely.

Read more
StringIO and BytesIO are perfect for making your Python faster.

Ever heard of something called a File Object in Python? Ever heard of BytesIO or StringIO? Your missing out. It’s easy, fast, and wonderful, in short, it’s the best. For some reason IO streams are a totally underused feature that rarely comes up in most code. We all know that memory if faster than disk IO, this is what I use IO streams for.

Read more

When building data pipelines all day long, every day, every year, ad infinitum, suprisingly I have managed learn some things. You see the same problems with data pipelines many times over. Years ago it was SSIS (I’m sorry you still have to use it, it just isn’t cool enough anymore), now if it’s not Streaming it must be wrong (Insert eye roll). The technology and what’s hot is always changing, but the 10 Commandments of Data Pipelines never change.

What are the 10 Commandments of Data Pipelines that thou shalt not break? Glad you asked.

Read more

You can’t go anywhere or read anything today in the IT world without running into Machine Learning, it’s the hot new thing. All the cool kids are doing it, so I thought I would give it a try too. A little Python, a little Sklearn, a little SparkML, and lots of reading later…. behold my not so wonderus KMeans Unsupervised Machine Learning …… thing.

Read more

Last time I shared my experience getting a mini Hadoop cluster setup and running. Lots of configuration and attention to detail. The next step in my grand plan is to figure out how I could use Python to interact ( store and retrieve files and metadata ) with HDFS. I assumed since there are beautiful packages to install for all sorts of things, pip installing some HDFS thingy would be easy and away I would sail into the sunset. Yeah…not.

Read more