Python Archives - Page 4 of 11 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Python

Replacing Pandas with Polars. A Practical Guide.

I remember those days, oh so long ago, it seems like another lifetime. I haven’t used Pandas in many a year, decades, or whatever. We’ve all been there, done that. Pandas I mean. I would dare say it’s a rite of passage for most data folk. For those using Python, it’s probably one of the first packages you use other than say … requests?

You know, Pandas feels like Airflow, everyone keeps talking about its demise, but there it is everywhere … used by everyone. Sure it’s old, wrinkled, annoying, slow, and obtuse, but it’s ours, and that makes it the words of Gollum … precious.

We should probably get to the point already. Everyone is talking about Polars. Polars is supposed to replace Pandas. Will it? Maybe 10 years from now. You can’t untangle Pandas from everywhere it exists overnight. Do you still want to replace Pandas with Polars and be one of the cool kids? Ok. Let’s take a look at a practical guide to replacing Pandas with Polars, comparing functionally used by most people. My code is available on GitHub.

January 19, 2023

Big Data, Data, Data Engineering, Python

What is Apache Arrow? Asking for a friend.

We’ve all been in that spot, especially in tech. You wanted to fit in, be cool, and look smart, so you didn’t ask any questions. And now it’s too late. You’re stuck. Now you simply can’t ask … you’re too afraid. I get it. Apache Arrow is probably one of those things. It keeps popping up here and there and everywhere.

The only reason I know anything about Arrow is that some years ago, circa 2019 and earlier I stumbled into Arrow and used it to read and write Parquet files (pyarrow that is). Heck, I even used it to tie together Python and Hadoop, Lord knows what I was thinking back then. I’m amazed at how much I used PyArrow back in the day, even to compare Parquet vs Avro.

“Back then it seems like no one used Arrow much, no one was writing about it, using it, or talking about it. At least not that I saw. But oh how times have changed. Arrow seems to be showing up everywhere and is starting to become a backbone for many other tools.”
– me

December 27, 2022

Big Data, Data, Data Engineering, Python, Ramblings, Rust

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

There once was a day when no one used DataFrames that much. Back before Spark had really gone mainstream, Data Scientists were still plinking around with Pandas a lot. My My, what would your mother say? How things have changed. Now everyone wants a piece of the DataFrame pie. I mean it tastes so good, doesn’t it?

Would anyone like a nice big slice of groupBy, maybe agg is what you need? No? Can you say distributed data set? Whatever it is you’re looking for, I’m quite sure a nice old DataFrame can give it to you. With so many options to choose from … what do you choose? I don’t know, whatever works best for you. But, it does set the stage nicely for a clash of the titans per see.

Let’s do this just that. Straight out of the box performance test. Bunch of CSV’s, a little aggregation, just some simple stuff. Mirror mirror on the wall, who is the fastest with DataFrames of them all?

December 10, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

New to PySpark? Do this, not that.

Do this, not that. Well, I’ve got my own list. With everyone jumping on the PySpark / Databricks / EMR / Glue / Whatever bandwagon I thought it was long overdue for a post on what to do, and not to do when working with Spark / PySpark. I take the pragmatic approach to working with Spark, it’s honestly very forgiving well and far into the 10s of TBs of data. Once you wander past that point things tend to get a little spicy if you don’t have it all dialed in. As with most things in life if you get a few things right, and of course don’t do some things, that will get you a long way, the same applies to Spark.

October 13, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Soda-Core. Data Quality at Scale.

Ever since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…

“Data reliability testing for SQL- and Spark- accesssible data.“
soda-core docs

Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.

September 14, 2022

Data, Data Engineering, Golang, Python, Rust

Working with Cloud Storage (s3). Golang vs Rust vs Python. Who shall emerge victorious?

I’ve always been a firm believer in using the right tool for the job. Sometimes I look at a piece of code … and ask … why? I mean just because you can do something doesn’t mean that you should. I see a lot of my job as someone who writes code … as not just my ability to write code, but the ability to reason about problems and design simple and elegant solutions that solve the problem at hand.

I try not to let my love of a tool, language, or package color my view of the world as it is. In fact, there is wisdom to be found in being critical of those languages and tools you love the most. Be aware of their shortcomings and failures. This leads to better software and architecture designs, and less complexity. Too often I’ve seen folks picking their tool of choice and then sticking with it till the bitter end, and it usually is bitter. There is more to life than writing obtuse Scala code that is illegible for some mundane task.

This sort of thing is a blight on everyone and every system. Now I must descend from my high horse and join the peasants on the dusty road of life. Today I want to look at some very common Data Engineering tasks, namely cloud storage, and what it is like to do such a thing with Golang, Rust, and Python. I will let you draw your own conclusions. Maybe. Code available on GitHub.

September 10, 2022

Big Data, Data, Data Engineering, Python

A Few Wonderful PySpark Features.

Just when I think it cannot get more popular, it does. I have to admit, PySpark is probably the best thing that ever happened to Big Data. It made what was once a myth, approachable to the average person. No need for esoteric Java skills, no more MapReduce, just plain old Python. Another amazing thing about Spark in general, and by extension PySpark, is the sheer amount of out-of-the-box capabilities. I wanted to dedicate this post to a few amazing and wonderful features of PySpark that make Data Engineering fun and powerful.

June 24, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

It still seems like the wild west of Data Quality these days. Tools like Apache Deque are just too much for most folks, and Data Quality is still new enough to the scene as a serious thought topic that most tools haven’t matured that much, and companies dropping money on some tool is still a little suspect. I’ve probably heard more about Great Expectations as a DQ tool than most.

With the popularity of PySpark as a Big Data tool, and Great Expectations coming into its own, I’ve been meaning to dive into what it would actually look like to to use Great Expectations at scale and answer some simple questions. How easy is it to get up and running with Spark, what’s the path of least resistance to getting some basic Data Quality checks in place in a data pipeline.

June 2, 2022

Big Data, Data, Data Engineering, Python

Data Engineering/Data Pipeline repo Project Template (free).

Probably one of the hardest hurdles to jump over when starting out in anything new, including Data Engineering and Data Pipelines, is knowing where to start. It always can be a little daunting. One aspect that can make or break any project, giving you the confidence to move forward like Sparticus to conquer, is having a good project template for your repository of code and logic that will encapsulate and present your code to others.

I’ve created a free and hopefully helpful Python blank GitHub project template that you can clone, change, and steal to your heart’s desire. I hope it will be helpful and set you going in the right direction for your next project.

May 8, 2022

Big Data, Data, Data Engineering, Python, Ramblings

Review of Prefect for Data Engineers

Not going to lie, I do enjoy the vendor wars that this marketing craze called “The Modern Data Stack” has created. I like to keep just about everything in life at arm’s length. Kinda like the way you look at your crazy third cousin out of the corner of your eye at the family reunion. I mean it’s nice to have all these options to choose from these days when building data pipelines.

One tool I haven’t been able to poke the tires on yet is Prefect. It appears to be another data orchestration tool for Python, but we shall find out. I want this to be an introduction to Prefect, we shall just try it out and let the chips fall where they may.

April 30, 2022

Replacing Pandas with Polars. A Practical Guide.

What is Apache Arrow? Asking for a friend.

Dataframe Showdown – Polars vs Spark vs Pandas vs DataFusion. Guess who wins?

New to PySpark? Do this, not that.

Soda-Core. Data Quality at Scale.

Working with Cloud Storage (s3). Golang vs Rust vs Python. Who shall emerge victorious?

A Few Wonderful PySpark Features.

Great Expectations with Databricks and Apache Spark. A Tale of Data Quality.

Data Engineering/Data Pipeline repo Project Template (free).

Review of Prefect for Data Engineers

Interesting links

Pages

Categories

Archive