Data Archives - Page 18 of 22 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Data Warehousing, Python

Intro to Apache Cassandra for Data Engineers

Hmm… yet another distributed database …. will it ever end? Probably not. It’s hard to keep up with them all, even the old ones. That brings me to Apache Cassandra. Of all the popular big data distributed databases Cassandra seems to be kind of that student who always sits in the back row and never says anything… you forget they are there…. until someone says their name….. Apache Cassandra. I honestly didn’t even know what space Cassandra fit in before trying to install and use it… so this should fun. What Is Cassandra? Distributed NoSQL.

December 10, 2020

Data, Data Engineering, Data Warehousing, SQL

Database/SQL Fundamentals for Data Engineers

I’ve meet my fair share of snooty people who poo poo SQL and databases as second class hand-me-downs. I still remember talking to an academic computer science grad who was explaining to me how he refused to teach database classes, he was just too good for that. Whatever. Apparently refusing to accept how 90% of companies are able to operate as data driven businesses just isn’t important to some people. There is probably nothing more important in the tool belt of a data engineer than being above average at SQL and databases. Tuning queries, writing queries, indexing, designing data warehouses. I’m sure there are some Hadoop data engineers who skipped this step of RDBMS world, but that is not the normal path of a data engineer. Let’s dive into the fundamentals of SQL and databases.

December 4, 2020

Data, Data Engineering, Ramblings, Scala

Scala with Text Files and ElasticSearch

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in Python, so all the snotty people who complain about Python not being fast enough or whatever can go hangout with this cow, looks like he could use a friend. This is something I’ve been meaning to do for awhile. Use Scala to read some text file(s), and store the data somewhere with some client. I chose ElasticSearch. I really just wanted practice doing something simple like reading files and I was curious about how good the Scala clients are for popular tools.

November 20, 2020

Big Data, Data, Data Engineering, Python, Uncategorized

Intro to Apache Beam for Data Engineers

What is this thing? What’s it good for? Who’s using it and why? That’s pretty much what I ask myself once a month when I actually see the name Apache Beam pop up in some feed I’m scrolling through. I figured it has to be legit to be Apache incubated, but I’ve never run across anyone in the wild using it yet. On the surface it appears to be semi-pointless since it runs on-top of other distributed systems like Spark, but I’m sure there is more to it. Today, I’m going to run through an overview of Apache Beam and then try installing and running some data through it, kick the tires as it were. And see if my mind changes about the pointless bit.

November 17, 2020

Data, Data Engineering, Python, SQL

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

Sometimes Pandas is slow like this…. until you tweak it.

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow and sucks… and not taking time to read the documentation. Here are some musings on using the to_sql() in Pandas and how you should configure to not pull your hair out.

November 11, 2020

Big Data, Data, Data Engineering, Python

Intro to Apache Kafka for Data Engineers

Streams, streams, streams…. when will it ever end? It’s hard to keep up with all the messaging systems these days. GCP PubSub, AWS SQS, RabbitMQ, blah blah. Of course there is Kafka, hard to miss that name floating around in the interwebs. Since pretty much every system designed these days is a conglomerate of services… it’s probably a good idea to poke at things under the cover. Of course Apache Kafka is probably at the top of list of those open source streaming services. Today I’m going to attempt to install a Kafka cluster and push some messages around.

October 26, 2020

Data, Data Engineering, Data Warehousing, SQL

Apparently Apache Hive is still a thing…. I should probably learn it.

So what’s up with Apache Hive? It’s been around a long time…but all the sudden it seems like it’s requirement in every other job posting these days. “It’s not you… it’s me.” That’s what I would tell Hive if it suddenly materialized as Mr. Smith via the Matrix that I’m pretty sure is the new reality these days. I’ve been around Hadoop and Spark for awhile now and I feel like Hive is that weird 2nd cousin who shows up at Thanksgiving. You know you should like and be nice to him, but you’re not sure why. It seems like Hive sits in a strange world. It’s not a RDBMS, although it does ACID, but it’s touted as a Data Warehousing tool. Time to dig in.

October 19, 2020

Data, Data Engineering, Geospatial, Python, SQL

Geospatial Data with Apache Spark – Intro to RasterFrames

Apache Spark and RasterFrames the big data geospatial processing juggernaut.

Yikes, distributed geospatial data processing at scale. That has fun written all over it… not. There isn’t that many people doing it so StackOverflow isn’t that useful. Anyone who has been around geospatial data knows the tools like GDAL are notoriously hard to use and buggy… and that one’s probably the “best.” What to do when you want to process and explore large satellite datasets like Landsat and Modis? Terrabytes/petabytes of data, what are going to do, download it? The power of distributed processing with Apache Spark. The simplicity of using SQL to work on geospatial data. Put them together… rasterframes. What a beast.

October 12, 2020

Data, Data Engineering, Data Warehousing, Python, SQL

PySpark SQLContext….tired of your decades old ETL process?

Seriously. Haven’t you had enough of SSIS, SAP Data Services, Informatica, blah blah blah? It’s been the same old ETL process for the last 20 years. CSV files appear somewhere, some poor old aged and angry Developer soul in a cubicle pulls up the same old GUI ETL tool, maps a bunch of columns to some SQL Server, if you’re in a forward thinking shop…maybe Postgres. This is after painstakingly designing the Data Warehouse with good ole’ Kimball in mind. Data flows from some staging table to some facts and dimensions. Eventually some SQL queries are run and a Data Mart is produced summarizing a years worth of data for a crabby Sales or Product department. Brings a tear to my eye. And this is all because Apache Spark sounds scary to some people?

September 25, 2020

Data, Data Engineering

Create Your Very Own Apache Spark/Hadoop Cluster….then do something with it?

I’ve never seen so many posts about Apache Spark before, not sure if it’s 3.0, or because the world is burning down. I’ve written about Spark a few times, even 2 years ago, but it still seems to be steadily increasing in popularity, albeit still missing from many companies tech stacks. With the continued rise os AWS Glue and GCP DataProc, running Spark scripts without managing a cluster has never been easier. Granted, most people never work on datasets large enough to warrant the use of Spark.. and Pandas works fine for them. Also, very annoyingly it seems most videos/posts on Spark about shuffling/joins blah blah that make no sense to someone who doesn’t use Spark on daily basis, or they are so “Hello World” as to be useless in the real world. Let’s solve that problem by setting up our own Spark/Hadoop cluster and doing some “real” data things with it.

September 21, 2020

Intro to Apache Cassandra for Data Engineers

Database/SQL Fundamentals for Data Engineers

Scala with Text Files and ElasticSearch

Intro to Apache Beam for Data Engineers

Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }

Intro to Apache Kafka for Data Engineers

Apparently Apache Hive is still a thing…. I should probably learn it.

Geospatial Data with Apache Spark – Intro to RasterFrames

PySpark SQLContext….tired of your decades old ETL process?

Create Your Very Own Apache Spark/Hadoop Cluster….then do something with it?

Interesting links

Pages

Categories

Archive