Python Archives - Page 2 of 11 - Confessions of a Data Guy

PyArrow vs Polars (vs DuckDB) for Data Pipelines.

I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format.

So, today we are going to do an experiment.

What if instead of writing a Data Pipeline in Polars, or another tool … that uses Arrow under the hood … what if we actually write a data pipeline with Arrow?

July 25, 2024

Data Engineering, Python

Building Open-Source Python Packages – SparklePop

One of the things I love about Python is its flexibility and huge community, a community that puts out a never-ending stream of useful packages for the average Software Engineer. In a show of solidarity to the open-source community, I thought I would publish a PYPI package that will probably be used by 5 people around the world.

June 12, 2024

Data, Data Engineering, Python

Is Python OOP the Devil? Or Savior?

Nothing will raise the hackles on the backs of hairy and pale programmers who’ve been stuck in their mom’s basement for a decade like bringing up OOP (Object Oriented Programming), especially in the context of Python. It’s like two fattened calves prepared for slaughter, sharpen your knives, and take your place, it’s time to feast upon the boiling cauldron of emotions simmering away in the interwebs.

June 3, 2024

Python

Why You Should Replace Pandas with Polars

I’m still amazed to this day how many folks hold onto stuff they love, they just can’t let it go. I get it, sorta, I’m the same way. There are reasons why people do the things they do, even if they are hard for us to understand. It blows my mind when I see something on r/dataengineering that people are still using SSIS for ETL to this day.

I guess it says something about a piece of technology, whatever it is when it refuses to roll over and die. There’s something to that software.

Yet, even though we may take off our hats and shake the hand of Pandas, it’s the overlord of all Python Data Folk, it still doesn’t mean we can’t move on. Show respect where respect is due, but push those able to move on, to move on into the future. When the future arrives you should consider making the switch.

Replacing Pandas with Polars.

In case you need a more detailed look into how you can actually replace Pandas with Polars, read more here. Really today it’s just a philosophical bunch of thoughts about the reasons behind why people should move on from Pandas to Polars, talk about why they don’t, and how to overcome those things.

Why replace?

Let’s just list some reasons.

Pandas is slow
Pandas can’t work on OOM datasets
Pandas can be cumbersome
Pandas don’t have SQL interface
Polars has SQL interface
Polars is based on Rust (the newest and coolest thing)
Polars can do what Pandas can do, better.
Other tools and general features will be more focused on Polars in the future
At some point, we have to accept that new things are simply better than the old thing

I mean we could go on forever.

What are the reasons people will fail to make the switch from Pandas to Polars (some of these are symptoms of other problems)?

The claim that Pandas is too entrenched in the codebase
Unable to dedicate the time to switch
Afraid of the unfamiliar
A culture that is unable to learn new tools
Claim some minor “thing” they do with Pandas can’t be done in Polars (miss the forest for the trees)

I think it’s more of a cultural thing when it comes to folk who don’t want to switch from Pandas to Polars. It’s probably a business that doesn’t like change, doesn’t deal with tech debt, doesn’t focus enough on Engineering improvements, where the status quo is never challenged.

I’m saying that there aren’t serious hurdles when replacing pieces of technology in any stack, but that doesn’t make it a non-starter. Most good things for us are hard, in life and Engineering. We want stability in our stack, yes, but we don’t want to be the people who are 15 years behind the changes either, we have to strike a balance.

Sometimes things come along like Polars, or Spark, or Snowflake, whatever … we can tell they are here to stay after a year or two. It’s clear the Data Engineering community and Platforms are moving in a direction, so why not move with it? No reason to stay behind and languish.

May 15, 2024

Data, Data Engineering, Python, Rust

Reading and Processing JSON with Rust vs Python.

Have you ever wondered about being explicit in your code vs being vague? I think about this a lot as I’m writing code on a daily basis. I’ve found I like being explicit and verbose when writing code, rather than being vague in what I’m doing most of the time.

May 2, 2024

Python, Ramblings

Google Fires Python. What Next?

What is going on? Is the world coming to an end? I thought Python was going to live forever. Well, apparently not at Google. Recently Google announced it was laying off its entire North American-based Python team that was supporting Google’s special needs with Python, in favor of cheaper offshore workers.

Apparently, some of these Engineers were GOAT-level employees.

Is that really that much of a surprise? Probably not. Ever since Bard, Google has been failing hard. They still make a ton of money but old Google has gone by-by. It’s a new Era of Corporate Google, and it’s here to stay.

Heck, you can still make bank working in Software at Google … but you better save some of those greens for when your name gets called.

April 30, 2024

Data, Data Engineering, Python

If:Else Logic and Complexity – Hiding the Pea.

I was recently confronted with an interesting conundrum when writing a complex data pipeline. It was an interesting problem that arose from my quest to reduce complexity in part of the design, which found itself creeping into another part, re-enforcing the classic idea of whether you can really make the complexity pea go away, or if you simply shuffle the pea somewhere else to hide it.

April 28, 2024

Data Engineering, Python

How to JOIN datasets in Polars … compared to Pandas.

It’s been a while since I wrote about Polars on this blog, I’ve been remiss. Some time ago I wrote a very simple comparison of switching from Pandas to Polars, I didn’t put much real effort into it, yet it was popular, so this is my attempt at trying to expand on that topic a little.

Recently, while laying flat on back on my sunporch soaking up the vitamin D beating down on me, dreaming about code, which I always do, it struck me.

April 7, 2024

Data, Data Engineering, Python, SQL

Building Databricks Data Pipelines 101

Have you ever wondered at a high level what it’s like to build production-level data pipelines on Databricks? What does it look like, what tools do you use?

March 29, 2024

Data, Data Engineering, Python

How To Build and Open Source PYPI Python Package

Ever wondered how to build and end-to-end project for an Open Source Python Package that gets published to PYPI? I built out lakescuman open-source package to help with Databricks Unity Catalog Delta Lake tables querying with Polars, DuckDB, or PyArrow. https://github.com/danielbeach/lakescum

March 25, 2024

PyArrow vs Polars (vs DuckDB) for Data Pipelines.

Building Open-Source Python Packages – SparklePop

Is Python OOP the Devil? Or Savior?

Why You Should Replace Pandas with Polars

Replacing Pandas with Polars.

Why replace?

Reading and Processing JSON with Rust vs Python.

Google Fires Python. What Next?

If:Else Logic and Complexity – Hiding the Pea.

How to JOIN datasets in Polars … compared to Pandas.

Building Databricks Data Pipelines 101

How To Build and Open Source PYPI Python Package

Interesting links

Pages

Categories

Archive