Recently, I was working on a little learning around DuckDB and AWS Lambda, which included some work with S3. It had been some time since I had tried working with files in S3, and it was kinda clunky the last time I tried it, whether it was DuckDB’s fault or mine, I was unsure.

It seems that when you go to Google and read about CSV files in S3 and DuckDB, in the past, folks had to do some gyrations with either boto3 or httpfs to get the job done. This is very annoying and clunky.

So I had the chance to revisit S3 and DuckDB with good ole’ CSV files and it was a much nicer experience. First, of course, you must have AWS credentials on the system somewhere, either in .aws or the ENV.

Something like this will do.

Next, you should instruct DuckDB to setup the AWS secrets, going through the normal rotation looking for credentials in the defaults spots as mentioned above.

Once this is done, querying a CSV file in S3 is as simple as it should be.

It’s nice to see DuckDB has first class support for querying files in S3, since cloud storage has become such a regular part of our lives. Any tool worth it’s weight in salt needs to have this sort of blind, easy to use file access that looks no different from operations on local files.

No more boto3 to load files into memory, etc, etc, that stuff needs to be abstracted away.

 

In my never-ending quest to plumb the most boring depths of every single data tool on the market, I found myself annoyed when recently using DuckDB for a benchmark that was reading parquet files from s3. What was not clear, or easy, was trying to figure out how DuckDB would LIKE to read default AWS credentials.

Read more

I am a glutton for punishment, a harbinger of tidings, a storm crow, a prophet of the data land, my sole purpose is to plumb the depths of the tools we use every day in Data Engineering. I find the good, the bad, the ugly, and splay them out before you, string ’em up and quarter them.

Today, for the third time, we put that ole’ Duck to the test. I want to test to see if DuckDB has fixed their OOM (Out Of Memory) errors on commodity hardware … that age old problem of “larger than memory data sets.

Read more

There are some things you don’t need until you need them. I ran into that situation recently with needing to process some CSV / Flatfiles on short notice. At first, it appeared to be easy, but then I realized, as usual, there was a little monkey wrench thrown into the middle of it.

It is nothing earth-shattering, it’s just something that comes up so rarely that I forget there are ways to deal with these inconveniences without jumping through unnecessary hoops.

Read more

Polars is the hot new Rust based Python Dataframe tool that is taking over the world and destryoing Pandas even as we speak. You want the quick and dirty introduction to Polars? Look no farther.

I’ve had something rattling around in the old noggin for a while; it’s just another strange idea that I can’t quite shake out. We all keep hearing about Arrow this and Arrow that … seems every new tool built today for Data Engineering seems to be at least partly based on Arrow’s in-memory format.

So, today we are going to do an experiment.

What if instead of writing a Data Pipeline in Polars, or another tool … that uses Arrow under the hood … what if we actually write a data pipeline with Arrow?

Read more

One of the things I love about Python is its flexibility and huge community, a community that puts out a never-ending stream of useful packages for the average Software Engineer. In a show of solidarity to the open-source community, I thought I would publish a PYPI package that will probably be used by 5 people around the world.

Read more

Nothing will raise the hackles on the backs of hairy and pale programmers who’ve been stuck in their mom’s basement for a decade like bringing up OOP (Object Oriented Programming), especially in the context of Python. It’s like two fattened calves prepared for slaughter, sharpen your knives, and take your place, it’s time to feast upon the boiling cauldron of emotions simmering away in the interwebs.

Read more