Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When is it ok to NOT normalize? [closed]

Consider the following relation to illustrate my question:

Person( name, street, city, zipcode )

name -> street , city , zipcode
street + city -> zipcode

So if we know the name, we also know where the person lives. But zipcode is also (transient) dependent on street + city. Thus, this relation breaks 3NF and should be split up into two tables to conform.

But in this case, we are not interested in zipcodes as a separate entity. It is part of the address, and just happens to be a transient dependent. We will never use it separately.

I understand why normalization is a good thing. But is it really necessary to always normalize (and thus make the database more complex)? If not, how do you know when you can skip it?

(if my terminology or notation is wrong, you are welcome to correct me)

like image 872
rlv-dan Avatar asked Sep 23 '12 13:09

rlv-dan


3 Answers

In addition to performance, one more reason for not fully normalizing might be if you have a certain "fuzziness" in your data.

As far as I understand1, ZIP may be specific to a city block or area, which means an especially long street could have more than one ZIP. And even if ZIP did correspond to city+street in US, that might not be true for postal codes in other countries, should you ever decide to go international.

But even assuming ZIPs really are city+street specific, human beings likely enter the address information themselves, which means they can make mistakes including incorrect ZIP. So you could end-up with two ZIPs for the same combination of city and street.

A fully normalized database simply has no way of representing that - you'd have to pick one of the ZIPs somehow. Unless you have an access to a full, up-to-date database of all ZIPs, you have no good way of resolving this conflict. If you end-up picking the wrong ZIP, all persons on the same city+street will have the wrong ZIP.

On the other hand, de-normalized database will let each person keep their own ZIP and later suffer the consequences in isolation from other persons. You could even implement an auto-complete suggestion and "are you sure?" warning in case user enters a different ZIP for the existing city+street that already has a ZIP, but then let him (or her) proceed if he indicated he is sure.


1 And I don't live in US, so I might be off.

like image 63
Branko Dimitrijevic Avatar answered Oct 22 '22 05:10

Branko Dimitrijevic


Normalization is a tool for analysing dependencies and ensuring the correct implementation of data integrity rules (business rules) represented as dependencies. An underlying assumption of normalization is that you know or can determine which business rules you actually want to implement. If you are already certain you don't want or need to enforce a given business rule then there is probably little value in considering it as a dependency when designing the database for it. Remember that the point of dependencies is that a rule is in force at all times for all possible data in the database; not just for current data or some particular subset of data.

It may be the case that the dependency {street,city} -> {zipcode} is not truly the desired business rule for the system and therefore should not be enforced. E.g. if data has to be entered without address verification software it may be impractical to ensure that zipcodes are consistent in that way. That doesn't mean you are violating any normalization rule. It just means the functional dependency isn't intended to hold and does not hold and therefore it isn't a transitive dependency in any real sense.

like image 26
nvogel Avatar answered Oct 22 '22 05:10

nvogel


The value and the cost of pushing normalization all the way depends. It depends mostly on what you will be doing with the data.

There are (at least) two radically different ways of using the data. One is On Line Transaction Processing (OLTP). The other is On Line Analytical Processing (OLAP).

In OLTP, the cost of not normalizing can be quite high. Transactions get to be more complex and slower, and bottlenecks degrade performance. In OLAP, the benefits of normalizing are limited, and there are other design disciplines that can yield more benefits for the same effort. One of those alternatives is star schema design, which you can look up.

But it isn't so much a matter of NOT normalizing, or of DEnormalizing, but of following a different design discipline, even if it doesn't result in a normalized database.

Getting back to the speciifc case you outlined, there are lots of systems where there is a heavy transaction load on customer activity, but the customer table is used for read only purposes in those transactions.

Failure to conform to 3NF is only going to hurt you when you have to enter a new customer, and you have to enter the zip code all over again, when there are already other customers with the same city, street, and zip code. And in the event that the post office changes the zip code assignment of a given street, you'll have to update lots of addresses instead of just one row in a normalized table.

That's not a very high cost, and not a very likely event.

On the other hand, how likely is it that the Post Office will take a single street, and split that street between two zip codes, depending on which block in the street the address is on? If this latter event happens, you're actually better off with the structure that violates 3NF. You are free to enter different zip codes for each address, using the information the Post Office gave about the split.

So, how likely is this second scenario? I think it's more likely than the first. But you need to go with your guess, and not mine.

like image 3
Walter Mitty Avatar answered Oct 22 '22 06:10

Walter Mitty