Yeah, data engineering seems to be a hot topic today, as much as data science is/was 3 years ago. What does a data engineer do, what skills do you need? Peruse the job postings it will quickly become overwhelming. Spark, Hadoop, SQL, Python, Scala, ETL, Data Warehousing, various Data Sciencey Things, Streaming, Analytics, Business Intelligence, Machine Learning, blah blah blah. What are the top skills need to be a successful Data Engineer? What does the average day look like for a Data Engineer? Here is my two cents, it’s probably worth what you paid to read this.

Read more

So after watching way too many end of the world movies on Netflix I decided the best way to prepare for the Zombie Apocalypse would be to give myself a way to know when the dead are about to crash through my living room window (while I’m eating popcorn watching zombies on on Netflix of course). This is one reason I love Python, I knew I would barely have to write any code to do this. I figured if I could scrape the popular news sites and do some simple sentiment analysis, get the government threat levels, some weather alerts etc, jam all this data together I would get a perfect Dooms Day clock to tell me how close we are to the end of the world on any given day. So lets begin. All the code is on GitHub. Here is visual of what I wanted.


Read more

I’ve been wanting to follow up on a post I did recently that was a quick intro to Apache Parquet, specifically when, where , and why to use it, maybe test some of its features, and what makes it a great alternative for flatfiles and csv files.

Read more

Columnstore indexes promise to be the savior of every data warehouse. So, what are they, when should you use them, when to stay away? Columnstore indexes are just what they sound like, data physically stored in a columnar way. This is what makes them so fast when it comes aggregating large amounts of data. The data is compressed and similar values are stored together, the database engine can grab all the values it needs to SUM for example, very quickly, this all leads to faster query results.

Read more

Last time I shared my experience getting a mini Hadoop cluster setup and running. Lots of configuration and attention to detail. The next step in my grand plan is to figure out how I could use Python to interact ( store and retrieve files and metadata ) with HDFS. I assumed since there are beautiful packages to install for all sorts of things, pip installing some HDFS thingy would be easy and away I would sail into the sunset. Yeah…not.

Read more


I’ve been wanting to get more hands on experience with Apache Hadoop for a years. It’s one thing to read about something and say yeah… I get it, but trying to implement it yourself from scratch just requires a whole different level of understanding. There seems to be something about trying to solve a problem that helps a person understand the technology a little better.

Read more

There sure has been a lot of kerfuffle around Spark lately. Spark this Spark that, Spark is the best thing ever, and so on and so forth. I recently had some small exposure to PySpark when working on a Glue project, at the time a lot of the functions reminded me of Pandas and I’ve been trying to find time to explore Spark a little more.

Read more

As someone who is self-taught when it comes to coding there are always topics that feel out of reach, or just plain magic. Also, as I’ve spent my career specializing in all things data, what I’ve needed to learn has always been very specific. Most of all, eventually the same old things become boring, time to try something new.

Enter concurrency and parallelism.

Read more

Update: Check out my new Parquet post.
Recently while delving and burying myself alive in AWS Glue and PySpark, I ran across a new to me file format. Apache Parquet.

It promised to be the unicorn of data formats. I’ve not been disappointed yet.

Read more

I work with Python and data a lot, specifically different RDBMS’s with structured data. Anyone who does this type of work will probably have run across pyodbc, a Python package that allows ODBC access into different
database platforms.

Read more