Why do database guys go on about normalisation? What is it? How does it help? Does it apply to anything outside of databases?

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption. There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem. First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions. The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table. For example consider a database of moons and planets: <pre class="prettyprint"><code>Moon(PK) | Planet | Planet kind ------------------------------ Phobos | Mars | Rock Daimos | Mars | Rock Io | Jupiter | Gas Europa | Jupiter | Gas Ganymede | Jupiter | Gas </code></pre> The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible: <pre class="prettyprint"><code>Moon(PK) | Planet | Planet kind ------------------------------ Phobos | Mars | Rock Deimos | Mars | Rock Io | Jupiter | Gas Europa | Jupiter | Rock <-- Oh no! Ganymede | Jupiter | Gas </code></pre> A query can now give inconsistent results which can have disastrous consequences. (Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.) The normalized design would split the table into two tables: <pre class="prettyprint"><code>Moon(PK) | Planet(FK) Planet(PK) | Planet kind --------------------- ------------------------ Phobos | Mars Mars | Rock Deimos | Mars Jupiter | Gas Io | Jupiter Europa | Jupiter Ganymede | Jupiter </code></pre> Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.) Rule of thumb If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values. We know the same information is still there, since we can write a <code>join</code> query which return the same data as the original un-normalized table. How do I avoid such problems? Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row. Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database. But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed. (There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.) Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized. There are a number of misconceptions around normalization: <ul> <li> some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest. </li> <li> sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF. </li> </ul> Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.

What is Normalisation (or Normalization)? [closed]

1 Answers

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.

First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.

The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.

For example consider a database of moons and planets:

Moon(PK) | Planet  | Planet kind ------------------------------ Phobos   | Mars    | Rock Daimos   | Mars    | Rock Io       | Jupiter | Gas Europa   | Jupiter | Gas Ganymede | Jupiter | Gas

The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:

Moon(PK) | Planet  | Planet kind ------------------------------ Phobos   | Mars    | Rock Deimos   | Mars    | Rock Io       | Jupiter | Gas Europa   | Jupiter | Rock <-- Oh no! Ganymede | Jupiter | Gas

A query can now give inconsistent results which can have disastrous consequences.

(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)

The normalized design would split the table into two tables:

Moon(PK) | Planet(FK)     Planet(PK) | Planet kind ---------------------     ------------------------ Phobos   | Mars           Mars       | Rock Deimos   | Mars           Jupiter    | Gas Io       | Jupiter  Europa   | Jupiter  Ganymede | Jupiter

Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)

Rule of thumb If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.

We know the same information is still there, since we can write a join query which return the same data as the original un-normalized table.

How do I avoid such problems? Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.

Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.

(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)

Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

There are a number of misconceptions around normalization:

some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.
sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.

Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.

answered Sep 28 '22 05:09

JacquesB

Related questions
                            
                                SQL Server: the maximum number of rows in table [closed]
                            
                                What's the better database design: more tables or more columns?
                            
                                Database design: Calculating the Account Balance
                            
                                SQL Server: how to constrain a table to contain a single row?
                            
                                How do you like your primary keys? [closed]
                            
                                How to create a new schema/new user in Oracle Database 11g?
                            
                                Why is a database always represented with a cylinder? [closed]
                            
                                Storing Business Hours in a Database
                            
                                What is cardinality in Databases?
                            
                                PostgreSQL Index Usage Analysis
                            
                                Good practices for designing monthly subscription system in database [closed]
                            
                                MongoDB Schema Design - Many small documents or fewer large documents?
                            
                                List of Constraints from MySQL Database
                            
                                MySQL - how to front pad zip code with "0"?
                            
                                Primary key/foreign Key naming convention [closed]
                            
                                How to store phone numbers on MySQL databases? [duplicate]
                            
                                Table Naming: Underscore vs Camelcase? namespaces? Singular vs Plural?
                            
                                Relationship of Primary Key and Clustered Index
                            
                                What is the optimal length for an email address in a database?
                            
                                How to delete from a table where ID is in a list of IDs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is Normalisation (or Normalization)? [closed]

Tags:

database-design

database-normalization

relational

AJ.

People also ask

1 Answers

JacquesB

Recent Activity

Donate For Us