There sure has been a lot of kerfuffle around Spark lately. Spark this Spark that, Spark is the best thing ever, and so on and so forth. I recently had some small exposure to PySpark when working on a Glue project, at the time a lot of the functions reminded me of Pandas and I’ve been trying to find time to explore Spark a little more.

Read more

As someone who is self-taught when it comes to coding there are always topics that feel out of reach, or just plain magic. Also, as I’ve spent my career specializing in all things data, what I’ve needed to learn has always been very specific. Most of all, eventually the same old things become boring, time to try something new.

Enter concurrency and parallelism.

Read more

Update: Check out my new Parquet post.
Recently while delving and burying myself alive in AWS Glue and PySpark, I ran across a new to me file format. Apache Parquet.

It promised to be the unicorn of data formats. I’ve not been disappointed yet.

Read more

I work with Python and data a lot, specifically different RDBMS’s with structured data. Anyone who does this type of work will probably have run across pyodbc, a Python package that allows ODBC access into different
database platforms.

Read more

I recently did a little project to find out what makes a company tick, using Python and the Twitter API. It has to be done quickly, in like a day, and didn’t need to be overly complicated.

Read more

One of the biggest hurdles I’ve found when teaching myself any sort of SQL/Python/Data Wrangling skills is the problem of finding usable, real life data to work with. Data that I can actually attempt to answer questions with.

Read more

Yeah, Web Scraping is super easy in Python, just pip install BeautifulSoup and away you go. Not.

Read more

Some of the most unused yet powerful functions in T-SQL are Window functions. These functions are powerful because they allow calculations on a Window of the data you specify, even while the calculation scrolls through your data.

Read more

Hmmm…. What to use… What to use? When I want to explore data quickly and with the least amount of pain, the first problem I face is where do I start. There are a million approaches and I’m usually thinking long-term, ease of maintenance, surrounding platform, etc etc.

Read more

Seriously…..It’s called a non-lookup dude. Probably one of the most annoying situations I’ve come across when working on Enterprise Data Warehouse {EDW} teams/projects is the non-lookup problem.

Read more