Home - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing, Python

Intro to Apache Cassandra for Data Engineers

Hmm… yet another distributed database …. will it ever end? Probably not. It’s hard to keep up with them all, even the old ones. That brings me to Apache Cassandra. Of all the popular big data distributed databases Cassandra seems to be kind of that student who always sits in the back row and never […]

December 10, 2020

Data, Data Engineering, Data Warehousing, SQL

Database/SQL Fundamentals for Data Engineers

I’ve meet my fair share of snooty people who poo poo SQL and databases as second class hand-me-downs. I still remember talking to an academic computer science grad who was explaining to me how he refused to teach database classes, he was just too good for that. Whatever. Apparently refusing to accept how 90% of […]

December 4, 2020

Data, Data Engineering, Ramblings, Scala

Scala with Text Files and ElasticSearch

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in […]

November 20, 2020

Big Data, Data, Data Engineering, Python, Uncategorized

Intro to Apache Beam for Data Engineers

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across […]

November 17, 2020

Data, Data Engineering, Python, SQL

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow […]

November 11, 2020

Data Engineering, Ramblings, Uncategorized

The Battlefield of the Data Engineer.

I want to interrupt your semi-regularly scheduled technical blog post for this public service announcement. I mean the url does say “confessions” does it not? For better or worse I’ve been thinking a lot lately about what it means to be a Data Engineer, what’s like to be a Data Engineer, and what makes a […]

November 2, 2020

Big Data, Data, Data Engineering, Python

Intro to Apache Kafka for Data Engineers

Streams, streams, streams…. when will it ever end? It’s hard to keep up with all the messaging systems these days. GCP PubSub, AWS SQS, RabbitMQ, blah blah. Of course there is Kafka, hard to miss that name floating around in the interwebs. Since pretty much every system designed these days is a conglomerate of services… […]

October 26, 2020

Data, Data Engineering, Data Warehousing, SQL

Apparently Apache Hive is still a thing…. I should probably learn it.

So what’s up with Apache Hive? It’s been around a long time…but all the sudden it seems like it’s requirement in every other job posting these days. “It’s not you… it’s me.” That’s what I would tell Hive if it suddenly materialized as Mr. Smith via the Matrix that I’m pretty sure is the new […]

October 19, 2020

Data, Data Engineering, Geospatial, Python, SQL

Geospatial Data with Apache Spark – Intro to RasterFrames

Yikes, distributed geospatial data processing at scale. That has fun written all over it… not. There isn’t that many people doing it so StackOverflow isn’t that useful. Anyone who has been around geospatial data knows the tools like GDAL are notoriously hard to use and buggy… and that one’s probably the “best.” What to do […]

October 12, 2020

Data, Data Engineering, Data Warehousing, Python, SQL

PySpark SQLContext….tired of your decades old ETL process?

Seriously. Haven’t you had enough of SSIS, SAP Data Services, Informatica, blah blah blah? It’s been the same old ETL process for the last 20 years. CSV files appear somewhere, some poor old aged and angry Developer soul in a cubicle pulls up the same old GUI ETL tool, maps a bunch of columns to […]

September 25, 2020

Intro to Apache Cassandra for Data Engineers

Database/SQL Fundamentals for Data Engineers

Scala with Text Files and ElasticSearch

Intro to Apache Beam for Data Engineers

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

The Battlefield of the Data Engineer.

Intro to Apache Kafka for Data Engineers

Apparently Apache Hive is still a thing…. I should probably learn it.

Geospatial Data with Apache Spark – Intro to RasterFrames

PySpark SQLContext….tired of your decades old ETL process?

Interesting links

Pages

Categories

Archive