Home - Confessions of a Data Guy

Learning to do HTTP in Scala. – Part Trois.

Man. Every time I open IntelliJ to write/learn some more Scala I have to take a deep breath. Yes, it’s been fun and good for my to my brain feel like in a Doctor Strange movie, but it’s also challenging and frustrating at times. One of the things I find myself doing a lot as […]

September 9, 2020

Data, Data Engineering, Machine Learning, Python

Deploying Apache Airflow inside Kubernetes.

Has anyone else noticed how popular Apache Airflow and Kubernetes have become lately? There is no better tool than Airflow for Data Engineers to built approachable and maintainable data pipelines. I mean Python, a nice UI, dependency graphs/DAGs. What more could you want? There is also no better tool than Kubernetes for building scalable, flexible […]

August 22, 2020

Data, Data Engineering, Data Warehousing, SQL

SQL Database (RDBMS) Design for Data Engineers

Database design… hmmm. There is probably nothing more all over the board in tech. Data warehousing, analytics, OLTP… everyone with their own “defend this hill to the death” ideas. Kimball vs Inmon. Hmmm.. what to do, what to do? After defending my own hills to the death over the years and arguing over whiteboards I’ve […]

July 20, 2020

Data Engineering, Python, Ramblings

Httpx vs Requests in Python. Performance and other Musings.

Someone recently brought up the new kid on the block, the httpx python package for http work of course. I mean the pypi packagerequests has been the de-facto standard forever. Can it really be overthrown? Is this a classic case of “oh how the mighty have fallen”? I want to explore what the new httpx […]

July 12, 2020

Data, Python, Ramblings

Hey Google Cloud, ever heard of Boto3? Come on.

First, let’s set the record straight. GCP is better than AWS. This will be clear to anyone who has used both services for a reasonable amount of time. GCP was built with the developer in mind, the services and tools offered work better, are cleaner, and way simplier. But, there is one thing that is […]

July 7, 2020

Data, Data Engineering, Python

How Chuck Norris Proved Async in Python isn’t Worthy.

There are some things I will never understand. Async in Python is one of them. Yes, sometimes I use it, but mostly because I’m bored and we all should have some kind of penance. Async in mine. It’s slow, confusing, other people get mad at you when they have to debug your Async code. I’ve […]

July 4, 2020

Data, Data Engineering, Scala

My Journey from Python to Scala – Part Deux

In Part 1 of my laborious journey from Python to Scala, I did some work with file operations, CSV files, and messing with the data. It took me a little longer then I expected to wrap my head around the Scala functional/object/immutable approach to software design. But, in the end if felt satisfying and I’m […]

June 22, 2020

Data, Data Engineering, Machine Learning, Python

Solving the Memory Hungry Pandas Concat Problem.

One of the greatest tools in Python is Pandas. It can read about any file format, gives you a nice data frame to play with, and provides many wonderful SQL like features for playing with data. The only problem is that Pandas is a terrible memory hog. Especially when it comes to concatenating groups of […]

June 8, 2020

Data, Data Engineering, Python, Scala

My Journey from Python to Scala – Part 1

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link! I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. […]

May 6, 2020

Python

The Utter Failure of Async in Python

I’m probably going to have to eat this blog post 2 years from now…. oh well. I still believe that Async has been mostly a failure since introduced in Python 3.4. Maybe I should be more specific, there seems to be a failure to adopt Async in the Python community and major packages at large. […]

April 16, 2020

Learning to do HTTP in Scala. – Part Trois.

Deploying Apache Airflow inside Kubernetes.

SQL Database (RDBMS) Design for Data Engineers

Httpx vs Requests in Python. Performance and other Musings.

Hey Google Cloud, ever heard of Boto3? Come on.

How Chuck Norris Proved Async in Python isn’t Worthy.

My Journey from Python to Scala – Part Deux

Solving the Memory Hungry Pandas Concat Problem.

My Journey from Python to Scala – Part 1

The Utter Failure of Async in Python

Interesting links

Pages

Categories

Archive