I was wondering the other day … since Polars now has a SQL context and is getting more popular by the day, do I need DuckDB anymore? These two tools are hot. Very hot. I haven’t seen this since Databricks and Snowflake first came out and started throwing mud at each other.

You might think it doesn’t matter. Two of one, half-dozen of another, whatever. But I think about these things. Simplicity is underrated these days. If you have two tools but could do it with one, should you use two? Probably depends on the Engineering culture you’re working in.

I mean just because you can doesn’t mean you should. Some data engineering repo with 50 different Python pip packages installed, constantly breaking and upgrading for no reason. CI/CD build failing, conflicts. Frustration. Why? Just because someone wants to do this one thing and decided they needed yet another package to do it.

Read more

Photo by Josh Rakower on Unsplash

So, you’ve heard about dbt have you. I honestly can’t decide if it’s here to stay or not, probably is, enough folks are using it, and preaching about it. I personally have always been a little skeptical of dbt, not because it can’t do what it says it can do, it can, but because I’m old and bitter from my many years of Data Engineering, and I always see the problems in things.

But, I will let you judge that for yourself. Today I want to give a brief overview of dbt, kick the tires, muse about its features, and most importantly, look at dbt from a Data Engineering perspective, ferret out the good, the bad, and the ugly. I will try my best to be nice but don’t count on it. Code is on GitHub.

Read more
Photo by Scott Webb on Unsplash

I mean, I’m sort of being facetious and sort of not. I mean there is some truth that rings out in those words. I’m sure someone selling Data Observability tools, or writing Great Expectations all-day will not like the idea of relying on only 2 data validations. But honestly, these two are probably more than 80% of Data Teams are using today for validation, which is none. What 2 are you? Glad you asked.

Read more

Sometimes I get to feeling nostalgic for the good ol’ days. What days am I talking about? My Data Engineering days when all I had to worry about was reading files with Python and throwing stuff into Postgres or some other database. The good ol’ days. The other day I was reminiscing about what I worked on a lot during the beginning of my data career. Relational databases plus Python was pretty much the name of the game.

One of the struggles I always had was how fast can I load this data into Postgres? psycopg2 was always my Python package of choice for working with Postgres, it’s a wonderful library. Today I want to give a shout-out to my old self by performance testing Python inserts into Postgres. There are about a million ways and sizes and shapes to getting a bunch of records from some CSV file, through Python, and into Postgres.

I also enjoy making people mad … there’s always that. Nothing makes people mad at you like a good ol’ performance test 🙂

Read more

What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. 15 years ago working on SQL Server I would ask myself the same question.

Are they really that different at all? Is it just a matter of preference? Let’s take a look at a few examples of CTE vs Subquery using SparkSQL as an example and see what we see.

Read more

Seriously, just don’t do it, they are bad for you. Listen to your mother, just say no. The dreaded ORM’s ( Object Relational Mapping ) that do all the hard SQL work for you. But, they come with many unintended consequences that are bad for your health and wellness in the long term. Many unsuspecting victims have been sucked into ORMs with the promise of an easier transition to allow programmers a familiar object-oriented design pattern for manipulating the data in a relational database, say Postgres or MySQL.

Again I tell you, don’t fall for the siren songs, there are tears and sorrow down the long and lonely ORM road.

Read more

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

Read more

Apache Druid, kinda like that second cousin you know about … but don’t really know. When you see them for the first time in 10 years you kinda look at them out of the corner of your eye. That’s how I feel about Apache Druid, I’ve always known it has been there, lurking around in the shadows, but it rarely pokes it head out and I have no idea what, why, how it is used. Time to change that, for the better or worse. Let’s take 10,000 foot survey of Druid.

Read more

I’ve meet my fair share of snooty people who poo poo SQL and databases as second class hand-me-downs. I still remember talking to an academic computer science grad who was explaining to me how he refused to teach database classes, he was just too good for that. Whatever. Apparently refusing to accept how 90% of companies are able to operate as data driven businesses just isn’t important to some people. There is probably nothing more important in the tool belt of a data engineer than being above average at SQL and databases. Tuning queries, writing queries, indexing, designing data warehouses. I’m sure there are some Hadoop data engineers who skipped this step of RDBMS world, but that is not the normal path of a data engineer. Let’s dive into the fundamentals of SQL and databases.

Read more
Sometimes Pandas is slow like this…. until you tweak it.

I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql() . Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow and sucks… and not taking time to read the documentation. Here are some musings on using the to_sql() in Pandas and how you should configure to not pull your hair out.

Read more