engineering
Third Normal Form (3NF) vs Star Schema for Data Warehouses
Tim Armstrong •
Third Normal Form (3NF) #
As the name suggests, the Third Normal Form aims to reduce data duplication (through normalization); this generally means having tables that detail all the features of the object that the table is modeling. This similarity to the data models used by programmers makes using an ORM (Object reference Model) the logical choice as it reduces the cognitive load on the developers, leading to fewer coding mistakes.
This format generally makes bulk queries much slower but makes writing and updating individual rows much faster. Which aligns with the goals and needs of developers.
For example, if you have a table that records details of Pets cared for by a Veterinarian, it probably contains columns for Name, Owner, Date of birth, Species, Colour, Sex, Weight, Height, Length, and Width.
Star Schema #
Star Schemas (and, by extension, Snowflake Schemas), however, differentiate data into two categories: Facts and Dimensions. It prioritizes query speed and simplicity over data storage space and writes times.
Dimensions are data elements common to many items (e.g., place names, various time keys, colors, etc.). These tables are primarily used as look-up tables for quickly querying things by common elements.
Facts are elements that represent the primary/unique data (e.g., a transaction, an invoice, a contract, etc.). A fact table will commonly have a lot of links to dimension tables.
This schema is really great for quick and straightforward lookups. This is because, in most SQL engines, a
JOIN
is considerably faster than a lookup – i.e., scanning a table for fields that match a range is much slower than looking up all of the rows that have a matching foreign key.For example:
Third Normal Form:
SELECT invoice_number, customer_id FROM invoices WHERE timestamp >= '2024-04-01T00:00:00' AND timestamp < '2024-07-01T00:00:00'
vs Star Schema:
SELECT f.invoice_value, f.customer_id FROM f_invoices AS f JOIN d_datetimekeys AS dt WHERE dt.year=2024 AND dt.quarter=2
Even in this straightforward example, you can see that, in the 3NF version, not only does the SQL engine have more work to do comparing timestamps for each row, but as data analysts, we have to calculate what those timestamps should be every time we need to make a query, which can lead to unnecessary mistakes.
In the Star Schema version, this is precalculated and stored in an easy-to-use table (reducing the risk of mistakes), and because the SQL engine is now just looking for rows with a matching foreign key, we get our result much faster due to the indexing.
Best of both worlds? #
Unfortunately, there isn’t a real “Best of both worlds” solution, and given the fundamentally different needs of Product and Data Teams, there probably won’t ever be.
So, to get around this, most Data Engineers use tools like DBT to periodically transform a snapshot of the data from the Product team’s 3NF Schema into a Star Schema for the Data team to use.