Normalization is the process of organizing a relational schema to reduce data redundancy and prevent update anomalies. The theory was introduced by E.F. Codd and is expressed as a series of normal forms (1NF, 2NF, 3NF, BCNF …) — each one stricter than the last.
-- Unnormalized: storing multiple values in one cell (violates 1NF)
-- orders: | id | customer | items |
-- data: | 1 | Alice | 'Pen, Pencil, Pad' |
-- After normalization: separate tables, one fact per cell
-- orders: | id | customer_id |
-- order_items: | order_id | product_id | qty |
Benefits: no redundant copies to keep in sync, constraints are enforceable, queries are composable.
Rule of thumb: normalize to at least 3NF for transactional (OLTP) schemas; selectively denormalize for read-heavy reporting (OLAP) only when profiling proves it necessary.
Un-normalized schemas suffer from three anomalies that make data unreliable:
- Insertion anomaly — you cannot store a fact without also storing another unrelated fact. E.g., you cannot record a new department unless you also have an employee for it.
- Update anomaly — the same fact appears in multiple rows. Changing a manager's name requires updating every row for every employee in that department. Miss one row → inconsistent data.
- Deletion anomaly — deleting a row removes more facts than intended. Delete the last employee in a department and you lose the department's name/location too.
-- Bad: employee table stores department info in every row
-- | emp_id | emp_name | dept_id | dept_name | dept_location |
-- If you rename the dept, you must update EVERY employee row.
-- Fixed (normalized): dept facts live in one place
CREATE TABLE departments (id INT PRIMARY KEY, name TEXT, location TEXT);
CREATE TABLE employees (id INT PRIMARY KEY, name TEXT, dept_id INT REFERENCES departments(id));
Rule of thumb: if the same value must be updated in more than one row to keep the data consistent, you have a normalization problem.
A table is in 1NF when:
- Every column contains atomic (indivisible) values — no sets, lists, or repeating groups inside a single cell.
- Every column contains values of a single type.
- Each row is uniquely identifiable (there is a primary key).
-- Violates 1NF: multiple phone numbers in one column
-- | id | name | phones |
-- | 1 | Alice | '555-1234, 555-5678' |
-- 1NF compliant: separate table for multi-valued attribute
CREATE TABLE contacts (id INT PRIMARY KEY, name TEXT);
CREATE TABLE contact_phones (
contact_id INT REFERENCES contacts(id),
phone TEXT NOT NULL,
PRIMARY KEY (contact_id, phone)
);
Rule of thumb: if a cell contains comma-separated values or you find
yourself doing LIKE '%value%' to search within a column, the table
violates 1NF and needs to be split.
A functional dependency (FD) A → B means that knowing the value of
A uniquely determines the value of B. In a table, a functional
dependency is a constraint on which combinations of values are valid.
-- In an orders table:
order_id → customer_id (each order has exactly one customer)
order_id → order_date
(order_id, product_id) → quantity (composite key determines quantity)
-- Problematic FD in an unnormalized table:
dept_id → dept_name (department name depends only on dept_id,
not on the full PK of the employee row)
Understanding FDs is the foundation of 2NF and 3NF: each normal form removes a class of problematic FDs from the schema.
Rule of thumb: draw out the FDs before designing a schema. Every non-key column should depend on the whole key and nothing but the key. (This is essentially the definition of 3NF in plain English.)
A table is in 2NF when it is in 1NF and every non-key column is fully functionally dependent on the whole primary key — not just part of it. 2NF only matters when the PK is composite.
-- Violates 2NF: PK is (order_id, product_id) but product_name
-- depends only on product_id (partial dependency)
-- | order_id | product_id | product_name | quantity |
-- Fix: split product facts into their own table
CREATE TABLE products (
id INT PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE order_items (
order_id INT REFERENCES orders(id),
product_id INT REFERENCES products(id),
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Rule of thumb: if any non-key column depends on part of a composite primary key, move those columns to a table where that partial key is the full primary key.
A table is in 3NF when it is in 2NF and no non-key column determines another non-key column (no transitive dependencies).
-- Violates 3NF: zip_code → city (transitive: emp_id → zip_code → city)
-- | emp_id | name | zip_code | city |
-- Fix: move the transitive dependency to its own table
CREATE TABLE zip_codes (zip TEXT PRIMARY KEY, city TEXT NOT NULL);
CREATE TABLE employees (
id INT PRIMARY KEY,
name TEXT NOT NULL,
zip_code TEXT REFERENCES zip_codes(zip)
);
The classic mnemonic: "Every non-key attribute must depend on the key, the whole key, and nothing but the key — so help me Codd."
Rule of thumb: if changing one non-key value (like a zip code) should automatically update another non-key value (the city), those values belong in a separate table joined by a foreign key.
BCNF (sometimes called 3.5NF) is a stricter version of 3NF. A table
is in BCNF if, for every non-trivial functional dependency A → B, A
is a superkey (a set of columns that uniquely identifies a row).
BCNF and 3NF differ only when a table has multiple overlapping candidate keys. 3NF allows a non-key column to determine part of another candidate key; BCNF does not.
-- Classic BCNF violation: Tutors(student, subject, tutor)
-- Candidate keys: (student, subject) and (student, tutor)
-- FD: tutor → subject (a tutor teaches exactly one subject)
-- This violates BCNF because 'tutor' is not a superkey.
-- Fix (decompose):
-- TutorSubject(tutor PK, subject)
-- StudentTutor(student, tutor, FK tutor → TutorSubject)
Rule of thumb: BCNF matters in schemas with multiple candidate keys. In practice, 3NF is the target for most applications; BCNF is pursued when redundancy in multi-key tables causes real anomalies.
Denormalization intentionally introduces redundancy into a schema to improve read performance — typically by precomputing joins or aggregations and caching their results in additional columns or tables.
-- Normalized: count must be computed with a JOIN every time
SELECT u.id, COUNT(o.id) AS order_count
FROM users u LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id;
-- Denormalized: cache the count in the users table
ALTER TABLE users ADD COLUMN order_count INT NOT NULL DEFAULT 0;
-- Maintain via trigger or application logic on each insert/delete
UPDATE users SET order_count = order_count + 1 WHERE id = NEW.user_id;
Trade-offs:
- ✅ Faster reads, simpler queries, reduced join cost.
- ❌ Write complexity — must keep redundant copies in sync.
- ❌ Risk of inconsistency if update logic is missed.
Rule of thumb: normalize first; denormalize only after profiling shows that a specific query is a bottleneck and the added write complexity is worth the read gain. Document every denormalized column with a comment explaining what it caches and how it is maintained.
A star schema is a denormalized dimensional model used in data warehouses (OLAP). It centers on a large fact table (events/transactions) surrounded by smaller dimension tables (descriptive attributes). Dimensions are intentionally denormalized for fast, simple queries.
-- Fact table: one row per sale event
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
quantity INT NOT NULL,
revenue NUMERIC(12,2) NOT NULL
);
-- Dimension: denormalized (city + country in same row, no 3NF)
CREATE TABLE dim_store (
store_key INT PRIMARY KEY,
name TEXT,
city TEXT,
country TEXT
);
OLTP schemas normalize to avoid write anomalies. OLAP star schemas denormalize to minimize joins and maximize scan throughput for analytics.
Rule of thumb: use a normalized 3NF schema for transactional applications; use a star or snowflake schema for analytical/BI workloads. Don't mix them — ETL pipelines transform data between the two.
Fully normalized schemas can require many joins, which hurts read performance on large datasets. Common pragmatic trade-offs:
- Add indexes before denormalizing — a join on indexed FKs is fast. Denormalization should only be considered after indexes fail to help.
- Materialized views / summary tables — precompute expensive aggregates without changing the base schema.
- Selective redundancy — add a
cached_countcolumn or a denormalizedstatusflag where read frequency vastly exceeds write frequency. - Separate OLAP schema — replicate data nightly into a star schema for reporting; keep OLTP tables normalized.
-- Before denormalizing: try an index on the join column
CREATE INDEX idx_orders_customer_id ON orders (customer_id);
-- EXPLAIN ANALYZE to verify it is used before adding redundant columns
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id)
FROM users u JOIN orders o ON o.user_id = u.id
GROUP BY u.id;
Rule of thumb: profile with real data before denormalizing. A missing index is the most common "normalization performance problem" — and it is a 30-second fix compared to the ongoing cost of maintaining denormalized data.
- Natural key: a column (or set of columns) from the real-world domain that uniquely identifies an entity — e.g., email address, ISBN, social security number. Natural keys carry meaning but can change over time.
- Surrogate key: a system-generated identifier with no business meaning —
e.g., an auto-increment
idor UUID. It never changes and has no domain semantics.
-- Natural key PK (email can change → cascading updates on all FKs)
CREATE TABLE users (email TEXT PRIMARY KEY, name TEXT);
-- Surrogate PK (id never changes; email change is isolated to one row)
CREATE TABLE users (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
name TEXT NOT NULL
);
Rule of thumb: use a surrogate PK for every table; declare natural keys
as UNIQUE constraints alongside the surrogate. This gives you the integrity
guarantee of natural keys without the cascade pain of changing a PK.
4NF eliminates multi-valued dependencies (MVDs) — independent many-to-many relationships stored in a single table, which causes multiplicative row explosion.
-- Violates 4NF: Employee can have many Skills AND many Projects, independently.
-- | emp_id | skill | project |
-- | 1 | SQL | Alpha |
-- | 1 | SQL | Beta | ← duplicating the SQL row for Beta
-- | 1 | Python | Alpha | ← duplicating Alpha row for Python
-- | 1 | Python | Beta |
-- 4NF: split into two separate tables
-- EmployeeSkills(emp_id, skill)
-- EmployeeProjects(emp_id, project)
Adding a new skill in the original table requires adding rows for every project combination, creating redundancy and anomalies.
Rule of thumb: when two many-to-many relationships are independent of each other but share the same entity, split them into two separate join tables rather than combining them into one.
Run through this checklist:
- 1NF check: every cell contains one atomic value; there is a primary key.
- 2NF check: if the PK is composite, every non-key column depends on the entire PK, not just part of it.
- 3NF check: no non-key column determines another non-key column
(no transitive dependencies — e.g.,
zip → citywhenzipis not the PK).
-- Red flags that indicate a 3NF violation:
-- 1. A column stores concatenated values ('red,blue,green')
-- 2. Repeated column groups (phone1, phone2, phone3)
-- 3. A non-PK column appears in WHERE of a JOIN as if it were a PK
-- 4. Updating one row's value requires updating dozens of other rows
-- 5. You cannot add a fact without inserting an unrelated row
Rule of thumb: if you can answer "what does each column tell you about?" and the answer is always "it tells you something about the primary key (and only the primary key)", the table is in 3NF.
A junction table (also called a bridge or associative table) resolves a many-to-many relationship between two entities into two one-to-many relationships. It stores the association as rows rather than as repeated columns.
-- Many students can enroll in many courses
CREATE TABLE students (id INT PRIMARY KEY, name TEXT NOT NULL);
CREATE TABLE courses (id INT PRIMARY KEY, title TEXT NOT NULL);
-- Junction table: one row per (student, course) pair
CREATE TABLE enrollments (
student_id INT NOT NULL REFERENCES students(id) ON DELETE CASCADE,
course_id INT NOT NULL REFERENCES courses(id) ON DELETE CASCADE,
enrolled_at DATE NOT NULL DEFAULT CURRENT_DATE,
PRIMARY KEY (student_id, course_id)
);
The junction table can carry payload columns (enrolled_at, grade) that describe the relationship itself — something impossible to store on either side alone.
Rule of thumb: whenever two entities have a many-to-many relationship, always model it with a junction table. Never store comma-separated IDs in a single column as an alternative.
Normalization is not always the right answer. Practical reasons to stop before reaching 3NF or BCNF:
- Read performance is the primary concern — heavily queried analytical tables benefit from fewer joins, even at the cost of redundancy.
- The relationship is stable — if a denormalized value (e.g., a country name embedded in every row) will almost never change, the update anomaly risk is negligible.
- External schema constraints — integrating with a vendor schema or legacy system you cannot change.
- Simplicity for small, short-lived data — a temporary staging table or a one-off report table does not need 3NF rigor.
-- Acceptable denormalization: reporting snapshot
-- Copies customer_name at the time of the order; intentionally redundant
-- so historical reports are stable even if the customer renames.
CREATE TABLE order_snapshots (
order_id BIGINT PRIMARY KEY,
customer_id INT NOT NULL,
customer_name TEXT NOT NULL, -- denormalized snapshot
total NUMERIC(12,2) NOT NULL,
snapped_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Rule of thumb: normalize transactional data to 3NF by default. Deviate deliberately, document the reason, and ensure the update path (trigger, ETL, application code) is clearly owned and tested.
More Schema & Data Types interview questions
More ways to practice
The self-quiz is live. Get notified when mock interviews and new question packs drop.