Data Engineering Archives - Page 2 of 23 - Confessions of a Data Guy

DuckDB reading CSVs from S3.

Recently, I was working on a little learning around DuckDB and AWS Lambda, which included some work with S3. It had been some time since I had tried working with files in S3, and it was kinda clunky the last time I tried it, whether it was DuckDB’s fault or mine, I was unsure.

It seems that when you go to Google and read about CSV files in S3 and DuckDB, in the past, folks had to do some gyrations with either boto3 or httpfs to get the job done. This is very annoying and clunky.

So I had the chance to revisit S3 and DuckDB with good ole’ CSV files and it was a much nicer experience. First, of course, you must have AWS credentials on the system somewhere, either in .aws or the ENV.

Something like this will do.

Next, you should instruct DuckDB to setup the AWS secrets, going through the normal rotation looking for credentials in the defaults spots as mentioned above.

Once this is done, querying a CSV file in S3 is as simple as it should be.

It’s nice to see DuckDB has first class support for querying files in S3, since cloud storage has become such a regular part of our lives. Any tool worth it’s weight in salt needs to have this sort of blind, easy to use file access that looks no different from operations on local files.

No more boto3 to load files into memory, etc, etc, that stuff needs to be abstracted away.

December 31, 2024

Big Data, Data, Data Engineering

Data Contracts were a LIE!

Today we talk about what is really going on with Data Contracts, they came in like a rocket a few years ago, but then died on the vine. What’s the deal?

December 13, 2024

Big Data, Data, Data Engineering

AWS S3 Tables. Technical Introduction.

Well, everyone is abuzz with the recently announced S3 Tables that came out of AWS reinvent this year. I’m going to call fools gold on this one right out of the gate. I tried them out, in real life that is, not just some marketing buzz, and it will leave most people, not all, be most, disappointed.

Surprise, surprise.

I wrote a more in-depth article here about the background and infighting between Databricks/Snowflake/AWS and the Lake House Storage Format wars. If you have time read that, but today, here, I just want to show you technically how to use S3 Tables in code.

Call it a technical introduction to S3 Tables.

December 7, 2024

Big Data, Data, Data Engineering, Python, SQL

DuckDB … reading from s3 … with AWS Credentials and more.

In my never-ending quest to plumb the most boring depths of every single data tool on the market, I found myself annoyed when recently using DuckDB for a benchmark that was reading parquet files from s3. What was not clear, or easy, was trying to figure out how DuckDB would LIKE to read default AWS credentials.

November 18, 2024

Data, Data Engineering, Python

Testing DuckDB’s Large Than Memory Processing Capabilities.

I am a glutton for punishment, a harbinger of tidings, a storm crow, a prophet of the data land, my sole purpose is to plumb the depths of the tools we use every day in Data Engineering. I find the good, the bad, the ugly, and splay them out before you, string ’em up and quarter them.

Today, for the third time, we put that ole’ Duck to the test. I want to test to see if DuckDB has fixed their OOM (Out Of Memory) errors on commodity hardware … that age old problem of “larger than memory data sets.”

October 31, 2024

Data, Data Engineering, Python

Skip Lines of CSV files with DuckDB and Polars

There are some things you don’t need until you need them. I ran into that situation recently with needing to process some CSV / Flatfiles on short notice. At first, it appeared to be easy, but then I realized, as usual, there was a little monkey wrench thrown into the middle of it.

It is nothing earth-shattering, it’s just something that comes up so rarely that I forget there are ways to deal with these inconveniences without jumping through unnecessary hoops.

October 24, 2024

Data, Data Engineering, Ramblings

How to make the PEFECT Pull Request (PR)

Is there anything worse than the PR process (Pull Request) at most companies? Probably not. It’s the dreaded 600-pound gorilla in the room that no one wants to talk about. Everyone hates it, everyone has to do it. But, it doesn’t have to be like that.

There are a few tried and true ways to make the perfect PR that takes all your problems away. Checkout the video for more.

October 12, 2024

Big Data, Data, Data Engineering, Data Warehousing

The Death of the Data Warehouse, replaced by the Lake House. Or Has It?

This is an interesting one indeed, it’s one that teases and puzzles the brain to no end. Has the Data Warehouse finally died, has that unruly upstart the Lake House finally taken its place atop the seething mass of data we call home? Can we say that after all these decades the Data Warehouse Toolkit and Kimball is finally gone the way of the dinosaurs? Maybe. Probably. I don’t know.

October 7, 2024

Data, Data Engineering, Ramblings

Hosted (SaaS) vs DIY Data Tools

I’ve been hacking around with tools and programming since Perl was a thing. I’ve worked the gambit of Data Platforms from large organizations to tiny startups, and all those in between. I’ve worked on Data Platforms that dropped ungodly amounts of money on SAP products, and places where we would build our own massive data processing platforms on Kubernetes.

Each to their own I guess.

October 3, 2024

AI, Big Data, Data, Data Engineering, Ramblings

AI (LLMs) and Software Engineering (Writing Code)

I recently wrote on my Substack (Data Engineering Central) about how I used the new OpenAI o1 model to do some basic Data Engineering tasks surrounding PostgreSQL. It did ok. I’ve also been using CoPilot and ChatGPT for over a year now to assist me with my daily code that I have to write for one reason or another.

September 24, 2024

DuckDB reading CSVs from S3.

Data Contracts were a LIE!

AWS S3 Tables. Technical Introduction.

DuckDB … reading from s3 … with AWS Credentials and more.

Testing DuckDB’s Large Than Memory Processing Capabilities.

Skip Lines of CSV files with DuckDB and Polars

How to make the PEFECT Pull Request (PR)

The Death of the Data Warehouse, replaced by the Lake House. Or Has It?

Hosted (SaaS) vs DIY Data Tools

AI (LLMs) and Software Engineering (Writing Code)

Interesting links

Pages

Categories

Archive