Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Normalization is a technique for organizing data in a database. It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates.
Database normalization is the process of organizing data into tables in such a way that the results of using the database are always unambiguous and as intended. Such normalization is intrinsic to relational database theory.
Normalization is a Six stage process - After the first stage, the data is said to be in first normal form, after the second, it is in second normal form, after the third, it is in third normal form and so on.
Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.
There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.
First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.
The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.
For example consider a database of moons and planets:
Moon(PK) | Planet | Planet kind ------------------------------ Phobos | Mars | Rock Daimos | Mars | Rock Io | Jupiter | Gas Europa | Jupiter | Gas Ganymede | Jupiter | Gas
The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:
Moon(PK) | Planet | Planet kind ------------------------------ Phobos | Mars | Rock Deimos | Mars | Rock Io | Jupiter | Gas Europa | Jupiter | Rock <-- Oh no! Ganymede | Jupiter | Gas
A query can now give inconsistent results which can have disastrous consequences.
(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)
The normalized design would split the table into two tables:
Moon(PK) | Planet(FK) Planet(PK) | Planet kind --------------------- ------------------------ Phobos | Mars Mars | Rock Deimos | Mars Jupiter | Gas Io | Jupiter Europa | Jupiter Ganymede | Jupiter
Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)
Rule of thumb If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.
We know the same information is still there, since we can write a join
query which return the same data as the original un-normalized table.
How do I avoid such problems? Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.
Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.
But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.
(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)
Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.
There are a number of misconceptions around normalization:
some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.
sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.
Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With