Home - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs AWS EMR – Theory and Real Life.

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for […]

August 19, 2021

Big Data, Data, Data Engineering, Python

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Sometimes I amaze myself. I’ve been using PySpark for a few years now, happily crunching hundreds of TBs of data without much problem. Sure you randomly run into OOM errors and other such nonsense. Usually inspecting the code for something silly, throwing in a persist() or cache() here and there will solve 99% of the […]

August 13, 2021

Big Data, Data, Data Engineering, Ramblings

Build your Data Engineering skills with Open Source Data

There are many a day when I find myself scrolling through the subreddit for r/dataengineerg, it’s a fun place to stalk. Lot’s of people with lots of opinions make for interesting times. I see one question or a variation of it come up over and over again. How do I learn data engineering skills, how […]

August 1, 2021

Big Data, Data, Data Engineering, Ramblings

5 Basic and Undervalued Data Engineering Skills

What is the standard for most data engineers these days? Turns out SQL and Python are still running the show pretty much across the board. There’s always a variety of skills in those areas, some better, some worse, although with a little work and repetition it’s pretty easy to master both SQL and Python. I’ve […]

July 26, 2021

Big Data, Data, Data Engineering, Python, Scala

String Slicing Performance – Python vs Scala vs Spark.

Good ole’ string slicing. That’s one thing that never changes in Data Engineering, working with strings. You would think we would all get to row up some day and do the complicated stuff, but apparently you can’t outrun your past. I blame this mostly on the data and old schools companies. Plain text and flat […]

July 17, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

Data Lake vs Data Warehouse – What’s the Dealio?

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data […]

July 10, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs Snowflake. The DataLake/Warehouse Battle.

As someone who worked around the classic Data Warehouses back in the day, before s3 took over and SQL Server and Oracle ruled the day … I love sitting on the sidelines watching new … yet old battle-lines being re-drawn. I could probably scroll back in StackOverflow 12 years and find the same arguments and […]

June 28, 2021

Data, Data Engineering, Python, Ramblings, Scala

Python vs Scala – Concurrency.

One of the reoccurring complaints you always see being parroted by the smarter-then-anyone-else-on-the-internet Reddit lurkers is the slowness of Python. I mean I understand the complaint …. but I don’t understand the complaint. Python is what is is, and usually is the best at what it is, hence its ubiquitous nature. I’ve been dabbling with […]

June 28, 2021

Big Data, Data, Data Engineering, Python

Apache Airflow Integration with DataBricks.

The two coolest kids in class … I mean seriously … every other post in Data Engineering world these days is about Apache Airflow or DataBricks. It’s hard to kick against the goad. Just jump on the band wagon before you get left in the dust. I’ve used both DataBricks and Apache Airflow, they both […]

June 14, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

Intro to Apache Druid … What is this Devilry

Apache Druid, kinda like that second cousin you know about … but don’t really know. When you see them for the first time in 10 years you kinda look at them out of the corner of your eye. That’s how I feel about Apache Druid, I’ve always known it has been there, lurking around in […]

June 7, 2021

Databricks vs AWS EMR – Theory and Real Life.

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Build your Data Engineering skills with Open Source Data

5 Basic and Undervalued Data Engineering Skills

String Slicing Performance – Python vs Scala vs Spark.

Data Lake vs Data Warehouse – What’s the Dealio?

Databricks vs Snowflake. The DataLake/Warehouse Battle.

Python vs Scala – Concurrency.

Apache Airflow Integration with DataBricks.

Intro to Apache Druid … What is this Devilry

Interesting links

Pages

Categories

Archive