CTE vs SubQuery
What to choose what to choose? The age-old problem that has plagued data engineers forever, ok maybe like 10 years, should you use CTE’s or Sub-Queries when writing your SQL code. This has become even more of a relevant topic with the rise of SparkSQL, Snowflake, Redshift, and BigQuery. Funny how some things never change. 15 years ago working on SQL Server I would ask myself the same question.
Are they really that different at all? Is it just a matter of preference? Let’s take a look at a few examples of CTE vs Subquery using SparkSQL as an example and see what we see.
The CTE
I still remember many moons ago as a young Data Analyst when I first discovered this wonderous concept of something called a CTE or Common Table Expression in my SQL travails and trials. CTE’s were made popular because they allowed for something fancy to be done in SQL, namely Recursion. By a CTE referencing itself, it could traverse a tree of unknown depth. But honestly, you are lucky if you get to work on one or two problems that need recursion in SQL.
So what are CTEs all about?
CTEs …
- defined using the
WITH x AS
( … ) syntax - are a temporary result set
- can be thought of as a
temporary table
- can be referenced multiple times in a query
- encapsulate logic and code.
WITH my_cte AS
(
SELECT a, b, c
FROM the_alphabet
)
SELECT *
FROM another_table as T
INNER JOIN my_cte as C ON T.a = c.a
;
At a basic level, there isn’t really anything that special about a CTE, other than the possibility of recursion, which only applies to %.01 of the persons reading this. So what can you do with a CTE? I don’t know. Anything you can do with a table.
Use them instead of a sub-query, join them, combine them, reference each other, I don’t know, do whatever you want to do.
The SubQuery
Also known as the infamous nested query, and usually gets a bad rap. Why? Overused just like anything else.
There are two types of sub-queries.
- correlated
- un-correlated
What does that mean? It means that a subquery might or might not have some specific relation to the query “above” it, or “over” it.
Example …
// un-correlated
SELECT p.first_name, p.last_name,
FROM peps as p
INNER JOIN (
SELECT department_id, SUM(amt) as dep_tot
FROM deps
GROUP BY department_id
) as d ON d.department_id = p.department_id
// correlated
SELECT p.first_name, p.last_name,
FROM peps as p
WHERE department_id IN (
SELECT d.department_id
FROM deps d
WHERE d.department_id = p.department_id
)
See the difference? Does the subquery reference something value from outside itself? This is in essence correlated or non-correlated subquery.
CTE vs SubQuery
So what really is the difference between the CTE and the SubQuery? Nothing much really. Preference and readability.
The problem with SubQueries is that when you are three levels deep it gets a little old and a little hard to unwind. I mean sure, complicated queries look cool … but three months later when there is a bug it isn’t cool.
Here is really the crux of the issue, when laying out complicated queries, CTEs can make the code flow and read much easier.
Instead of this …
SELECT p.first_name, p.last_name, d.department_count, s.total_sales
FROM persons as p
INNER JOIN (
SELECT department_id, COUNT(people) as department_count
FROM department as d
WHERE department_type = 'sales'
GROUP BY department_id
) as d ON d.department_id = p.department_id
LEFT OUTER JOIN (
SELECT person_id, SUM(sales) as total_sales
FROM orders
WHERE orders.department_id = d.department_id
GROUP BY person_id
) as s ON s.person_id = p.person_id
We could do this …
WITH deps as (
SELECT department_id, COUNT(people) as department_count
FROM department as d
WHERE department_type = 'sales'
GROUP BY department_id
),
sales as (
SELECT person_id, SUM(sales) as total_sales
FROM orders
WHERE orders.department_id = d.department_id
GROUP BY person_id
)
SELECT p.first_name, p.last_name, d.department_count, s.total_sales
FROM persons as p
INNER JOIN deps as d ON d.department_id = p.department_id
LEFT OUTER JOIN sales as s ON s.person_id = p.person_id
The difference probably isn’t as noticeable in this contrived example, but I assure you once you’ve got 3 or 4 levels of subqueries, the readability, maintainability, and extensibility of SQL goes through the roof.
It’s really about encapsulating logic and improving readabiliity and maintainability.
The hard part about SQL is that most people don’t consider it a “programming language,” y’all can argue about that one later. The unfortunate downside of SQL not being considered as a programming language is that it misses out on some of the best software engineering principles. It’s probably why the old Database Developers got such a bad rap back in the day. Piles of queries that look ugly with no testing.
Newer technology like SparkSQL and dbt are changing that. But, using CTEs is probably one of the best ways to apply the idea of encapsulating logic to the world of SQL.
Just like long, jumbled, drawn-out code written in Python with functions that are 100 lines long … SQL queries that might look impressive but take scrolling and scrolling and scrolling with sub-queries here and there and everywhere become ripe for bugs and headaches.
CTEs can help with this mess. You can name them what they are, logically group them together, encapsulate specific code in them … what more could you ask for?
Should I worry about performance of a CTE vs a SubQuery?
I wouldn’t. It probably depends greatly on the underlying database to technology, the indexes, the query, and a whole host of other things. It’s probably different in different cases. Unless someone can prove to you that the speed is 50% faster, which I doubt will be the case, you should go for the readability and maintainability of the CTE every day.
There is more than one way to measure the “cost” of something. Illegible code … SQL or otherwise is probably the number one most costly problem. Far and away above that of a “slow” query.