Apache Parquet vs Apache Avro

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with the rise of distributed data processing engines like Spark.

Read more

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard of it, but it’s usage seems to be sporadic at best. I’m going to talk about what makes Apache Airflow the perfect tool for any Data Engineer, and show you how you can use it to great effect while not committing to it completely.

Read more
StringIO and BytesIO are perfect for making your Python faster.

Ever heard of something called a File Object in Python? Ever heard of BytesIO or StringIO? Your missing out. It’s easy, fast, and wonderful, in short, it’s the best. For some reason IO streams are a totally underused feature that rarely comes up in most code. We all know that memory if faster than disk IO, this is what I use IO streams for.

Read more

When building data pipelines all day long, every day, every year, ad infinitum, suprisingly I have managed learn some things. You see the same problems with data pipelines many times over. Years ago it was SSIS (I’m sorry you still have to use it, it just isn’t cool enough anymore), now if it’s not Streaming it must be wrong (Insert eye roll). The technology and what’s hot is always changing, but the 10 Commandments of Data Pipelines never change.

What are the 10 Commandments of Data Pipelines that thou shalt not break? Glad you asked.

Read more

You can’t go anywhere or read anything today in the IT world without running into Machine Learning, it’s the hot new thing. All the cool kids are doing it, so I thought I would give it a try too. A little Python, a little Sklearn, a little SparkML, and lots of reading later…. behold my not so wonderus KMeans Unsupervised Machine Learning …… thing.

Read more

Last time I shared my experience getting a mini Hadoop cluster setup and running. Lots of configuration and attention to detail. The next step in my grand plan is to figure out how I could use Python to interact ( store and retrieve files and metadata ) with HDFS. I assumed since there are beautiful packages to install for all sorts of things, pip installing some HDFS thingy would be easy and away I would sail into the sunset. Yeah…not.

Read more