Someone recently brought up the new kid on the block, the httpx python package for http work of course. I mean the pypi packagerequests has been the de-facto standard forever. Can it really be overthrown? Is this a classic case of “oh how the mighty have fallen”? I want to explore what the new httpx […]

First, let’s set the record straight. GCP is better than AWS. This will be clear to anyone who has used both services for a reasonable amount of time. GCP was built with the developer in mind, the services and tools offered work better, are cleaner, and way simplier. But, there is one thing that is […]

There are some things I will never understand. Async in Python is one of them. Yes, sometimes I use it, but mostly because I’m bored and we all should have some kind of penance. Async in mine. It’s slow, confusing, other people get mad at you when they have to debug your Async code. I’ve […]

In Part 1 of my laborious journey from Python to Scala, I did some work with file operations, CSV files, and messing with the data. It took me a little longer then I expected to wrap my head around the Scala functional/object/immutable approach to software design. But, in the end if felt satisfying and I’m […]

One of the greatest tools in Python is Pandas. It can read about any file format, gives you a nice data frame to play with, and provides many wonderful SQL like features for playing with data. The only problem is that Pandas is a terrible memory hog. Especially when it comes to concatenating groups of […]

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link! I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. […]

I’m probably going to have to eat this blog post 2 years from now…. oh well. I still believe that Async has been mostly a failure since introduced in Python 3.4. Maybe I should be more specific, there seems to be a failure to adopt Async in the Python community and major packages at large. […]

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with […]

ml pipelines Building Machine Learning (ML) pipelines with big data is hard enough, and it doesn’t take much of a curve ball to make it a nightmare. Most of what you will read online are tutorials on how to take a few CSV files and run them through some sklearn package. If you are lucky, […]

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard […]