Why normalization exists
Normalization is a process of structuring a relational database to reduce data redundancy and prevent update anomalies — bugs where the same fact is stored in multiple places and they fall out of sync. The formal definition comes in numbered normal forms (NF). In practice, reaching 3NF is the standard goal for OLTP databases.
The unnormalized starting point
Imagine you started with a single spreadsheet for orders:
order_id | customer_name | customer_email | product_ids | product_names | total
---------|---------------|----------------------|---------------|---------------------|-------
1001 | Alice Brown | alice@example.com | 42, 55 | Keyboard, Mouse | 150.00
1002 | Alice Brown | alice@example.com | 42 | Keyboard | 75.00
1003 | Bob Smith | bob@example.com | 55 | Mouse | 75.00
Problems: Alice's email is duplicated. product_ids and product_names are
comma-lists in one cell. If Alice changes her email, we must update every row.
If a product is renamed, we must find every order containing it.
First Normal Form (1NF) — no repeating groups
1NF requires that each column holds a single atomic value — no comma-lists, arrays, or JSON blobs where each element represents a separate fact.
Fix: split product_ids / product_names into a separate order_items
table with one row per product per order.
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(254) NOT NULL,
total_amount NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE order_items (
order_id BIGINT NOT NULL REFERENCES orders(id),
product_id INT NOT NULL,
product_name VARCHAR(200) NOT NULL,
unit_price NUMERIC(10,2) NOT NULL,
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Now each cell holds one value. But customer_name and customer_email are
still duplicated across every order for the same customer.
Second Normal Form (2NF) — no partial dependencies
2NF applies to tables with composite primary keys. It requires that every non-key column depends on the whole key, not just part of it.
In order_items, product_name depends only on product_id (part of the
key), not on (order_id, product_id). That is a partial dependency — a
2NF violation.
Fix: move product_name to a products table keyed on product_id.
CREATE TABLE products (
id INT PRIMARY KEY,
product_name VARCHAR(200) NOT NULL,
unit_price NUMERIC(10,2) NOT NULL
);
CREATE TABLE order_items (
order_id BIGINT NOT NULL REFERENCES orders(id),
product_id INT NOT NULL REFERENCES products(id),
quantity INT NOT NULL,
unit_price NUMERIC(10,2) NOT NULL, -- price at time of order (snapshot)
PRIMARY KEY (order_id, product_id)
);
Third Normal Form (3NF) — no transitive dependencies
3NF requires that non-key columns depend only on the primary key, not on other non-key columns (a transitive dependency).
In the orders table, customer_email depends on customer_name, not on
order_id. If we store customer_name → customer_email in every order,
updating the email requires finding every order for that customer.
Fix: extract customers into their own table.
CREATE TABLE customers (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name VARCHAR(100) NOT NULL,
email VARCHAR(254) NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE orders (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
customer_id BIGINT NOT NULL REFERENCES customers(id),
total_amount NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Now updating a customer's email requires changing exactly one row in customers.
All their orders automatically reflect the new email via the join.
The final 3NF schema
-- customers (one row per customer)
-- products (one row per product)
-- orders (one row per order, references customers)
-- order_items (one row per product per order, references orders + products)
-- Query: full order details
SELECT
o.id AS order_id,
c.name AS customer_name,
p.product_name,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
WHERE o.id = 1001;
When to denormalize
Normalization is right for OLTP (transactional writes). For analytics and reporting (OLAP), denormalization trades write anomaly risk for read performance.
-- Denormalized analytics table (pre-joined, no further joins needed)
CREATE TABLE order_facts AS
SELECT
o.id,
o.created_at,
c.country_code,
p.category,
oi.quantity * oi.unit_price AS line_revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id;
This table is a snapshot — data warehouse tables, materialized views, and reporting replicas are acceptable places for denormalization. The normalized source tables remain the system of record for writes.
Recap
Normalize your OLTP schema to at least 3NF: 1NF eliminates repeating groups (no arrays in cells); 2NF eliminates partial dependencies (non-key columns must depend on the whole key); 3NF eliminates transitive dependencies (non-key columns must depend only on the primary key). Denormalization is a deliberate optimisation for read-heavy workloads — always denormalize a copy, not the source of truth.