Three ways to combine data, one right answer per situation
Every intermediate SQL writer reaches a point where they can express the same query as a
JOIN, a subquery, or a CTE — and aren't sure which to pick. The choice matters: a
correlated subquery can turn a millisecond query into a minutes-long scan; a CTE with the
wrong database can silently perform worse than a join. This guide explains how the query
planner treats each form, when performance diverges, and the judgment that makes the
right choice obvious.
Quick-reference comparison
JOIN | Subquery | CTE (WITH) | |
|---|---|---|---|
| What it does | Combines rows from two tables | Returns a result set used by the outer query | Names a temporary result set, referenced above the main query |
| Execution | Planner optimises join order, indexes | Inline subqueries often merged with outer; correlated ones run per row | Usually materialised once, or inlined by planner (database-dependent) |
| Performance | Generally fastest — planners are tuned for joins | Inline: similar to join. Correlated: O(n) per outer row | Depends on dialect — Postgres often materialises, MySQL/SQL Server may inline |
| Readability | Concise for simple cases; nests poorly | Can be hard to read inside WHERE or FROM | Best readability for multi-step logic |
| Reusable? | No | No | Yes — reference the same CTE multiple times |
| Recursive? | No | No | Yes (WITH RECURSIVE) |
| Best for | Combining tables, filtering by relation | Scalar lookups, EXISTS checks | Multi-step transforms, recursive queries, readability |
JOINs — the planner's home territory
A JOIN is the primary way relational databases combine data. The query planner has
decades of optimisation for join operations: it picks hash joins, merge joins, or nested
loop joins based on table size and index availability, and it can reorder joins freely to
minimise intermediate row counts.
-- Simple inner join: orders with their customer names
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.total > 100;
-- Planner can use index on customer_id and o.total simultaneously
Joins shine for combining tables on a key and filtering by relationship. They're the right tool when both sides of the relationship are needed in the result.
-- LEFT JOIN: all customers, even those with no orders
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;
Rule of thumb: default to a JOIN for combining tables. If you find yourself writing
a subquery where a join would work just as well, the join is almost always faster and
clearer.
Deep dive: Joins interview questions
Subqueries — scalar lookups and existence checks
A subquery is a SELECT nested inside another statement. There are two fundamentally
different kinds, and their performance characteristics are opposite.
Non-correlated subqueries (fast)
A non-correlated subquery runs once and hands its result to the outer query. The planner usually optimises it identically to a join.
-- Non-correlated: inner query runs ONCE, result reused
SELECT name
FROM products
WHERE price > (SELECT AVG(price) FROM products);
-- ↑ computed once, compared against every row
Correlated subqueries (can be slow)
A correlated subquery references a column from the outer query, forcing it to re-execute for every row the outer query processes. On a million-row table, that's a million subquery executions.
-- Correlated: re-runs the subquery for EVERY employee row — O(n²) risk
SELECT name, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department -- ← references outer row
);
The fix is almost always to push the subquery into a JOIN or CTE so it runs once:
-- Better: aggregate once, then join
WITH dept_avg AS (
SELECT department, AVG(salary) AS avg_sal
FROM employees
GROUP BY department
)
SELECT e.name, e.salary
FROM employees e
JOIN dept_avg d ON d.department = e.department
WHERE e.salary > d.avg_sal;
EXISTS vs IN for membership checks
For checking whether a related row exists, EXISTS is usually faster than IN on large
sets — it short-circuits as soon as it finds one match, while IN builds the full set.
-- EXISTS: stops scanning as soon as one match is found
SELECT name FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
-- IN: materialises the full set of customer_ids first
SELECT name FROM customers
WHERE id IN (SELECT customer_id FROM orders);
Rule of thumb: use a subquery for scalar lookups (single value comparisons) and
EXISTS checks. Never use a correlated subquery where a join or CTE with a GROUP BY
would work — you'll turn a fast query into a table scan.
Deep dive: Subqueries interview questions
CTEs (WITH) — readability and multi-step logic
A Common Table Expression names a temporary result set at the top of a query. The main query can reference it like a table, and the same CTE can be referenced multiple times. CTEs don't change what is possible — anything a CTE expresses could also be a subquery — but they dramatically improve readability for multi-step logic.
-- Without CTE: deeply nested, hard to read
SELECT * FROM (
SELECT department, AVG(salary) AS avg_sal
FROM (
SELECT * FROM employees WHERE active = true
) active_emps
GROUP BY department
) dept_avg
WHERE avg_sal > 80000;
-- With CTE: reads top to bottom like a story
WITH active_employees AS (
SELECT * FROM employees WHERE active = true
),
dept_avg AS (
SELECT department, AVG(salary) AS avg_sal
FROM active_employees
GROUP BY department
)
SELECT * FROM dept_avg WHERE avg_sal > 80000;
CTEs are also the only way to write recursive queries — traversing trees, graphs, or hierarchies — without procedural code:
-- Recursive CTE: walk an org chart
WITH RECURSIVE org AS (
SELECT id, name, manager_id, 0 AS depth
FROM employees WHERE manager_id IS NULL -- root
UNION ALL
SELECT e.id, e.name, e.manager_id, o.depth + 1
FROM employees e
JOIN org o ON o.id = e.manager_id -- recurse
)
SELECT * FROM org ORDER BY depth;
CTE materialisation — the performance caveat
Whether a CTE is a performance help or hindrance depends on the database:
- PostgreSQL (pre-14): CTEs are always materialised (computed once, result stored). This can be faster (when referenced many times) or slower (when the optimiser could push a filter inside).
- PostgreSQL 14+ / SQL Server / MySQL 8+: the planner may inline a CTE, treating it like a subquery and optimising freely.
- Oracle: similar to Postgres 14+ — may inline or materialise based on cost.
When you need Postgres to not materialise a CTE, use WITH ... AS NOT MATERIALIZED (14+).
Rule of thumb: use a CTE whenever a query has more than two logical steps or when you need to reference the same intermediate result more than once. Prefer a join over a CTE when performance is critical and the planner version is uncertain.
Deep dive: CTEs interview questions
The decision flowchart
Do I need to combine columns from two tables?
└── Yes → JOIN (default choice; planner is optimised for it)
Do I need a single value from another table?
└── Yes → scalar subquery (SELECT MAX(...) etc.)
Do I just need to check whether a related row exists?
└── Yes → EXISTS subquery (short-circuits, usually faster than IN)
Does my logic have multiple steps, or do I reference the same
intermediate result more than once?
└── Yes → CTE (readability + reuse; watch materialisation in old Postgres)
Does the query need to walk a tree or graph?
└── Yes → recursive CTE (no other clean SQL option)
Interview tip — explain the trade-offs
Interviewers asking "use a join or a subquery?" are testing whether you understand the performance implications. The safe answer:
"I'd default to a join because planners have the most optimisation for that form. I'd use a non-correlated subquery for scalar comparisons,
EXISTSfor membership checks, and a CTE to break complex multi-step logic into readable pieces. I'd avoid correlated subqueries except in rare cases where the database rewrites them to a join anyway."
Recap
JOINs are the planner's home territory — use them to combine tables and let the
optimiser choose the best execution strategy. Non-correlated subqueries run once and
are often rewritten to joins internally; correlated subqueries re-run per outer row
and can silently cause O(n²) performance — replace them with a JOIN + GROUP BY or a
CTE. CTEs trade raw performance for readability and reusability, and are the only way
to write recursive queries; their materialisation behaviour varies by database version.
Default to joins, escalate to CTEs for clarity, and avoid correlated subqueries on large
tables.