September 2021 - Confessions of a Data Guy

gRPC for Data Engineers

If you’ve been around Data Engineering for a while, like me, you’ve noticed a few trends in the industry at wide, and in individual data engineers themselves. There seem to be a few types of data engineers, and it depends on where you’ve worked, and what your projects have looked like that put you here or there. Some data engineers focus on general ETL, Data Warehousing, and such things. They move data around and transform it using a myriad of tools. The other set of data engineers are more focused on infrastructure at a low level, they provide the underlying tools and services others use to make that data move around and transfer.

Which are you? One of those topics you may or may not be familiar with depending on your background is RPC or more specifically gRPC. What is it?

The Wild West of Parallel Computing – Review of Bodo.ai

It truly is the Wild West of parallel computing these days. It seems that big data has brought out an onslaught of companies trying to either take advantage of making it easier to use any number of big data platforms or making up their own. Most of them usually take shots at tools like Spark and Dask, probably two of the more well-known big data engines. Of course with Python’s rise, especially in Data Science and ML, many of these tools target that audience.

One such newcomer is Bodo.ai, and I’ve seen them pop up on places like r/dataengineering. Fortunately, they have a free community edition, so let’s kick the tires and see what’s going on.

Dask vs PySpark – Performance and Other Thoughts.

Every once in awhile I see someone talking about their wonder distributed cluster of Dask machines, and my curiosity gets aroused. I know plenty of people use Dask, mostly on their local machines, but it seems like the meteoric rise of Spark, especially with tools like EMR and Databricks, that Dask is slowly slipping into the shadows. I’ve had bad experiences with Dask in the past, trying to get it work well in production. I suppose that comes from working with tried and true Spark and other bullet proof distributed system. I’ve been meaning to return to Dask for awhile, compare a similar Dask and Spark cluster on performance … and other things like ease of setup and writing code. Let’s get too it.

gRPC for Data Engineers

The Wild West of Parallel Computing – Review of Bodo.ai

Dask vs PySpark – Performance and Other Thoughts.

Interesting links

Pages

Categories

Archive