Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

At what point does data normalization become ludicrous?

I often find myself questioning whether I'm taking the right approach in trying to plan for future expansibility when creating databases and relations.

I have the following situation:

  1. I have a Donor table and a Recipient table. Both tables share common information such as first_name, last_name, email_address, date_of_birth, etc. Both seem to, if you'll excuse my object-oriented language, share a common abstract type of Person. It's possible that someone who is at one point a Recipient may later become a Donor by means of giving a donation, so it's important that information isn't duplicated across tables. Should I opt for an inheritance pattern, or should I just foreign key Donors and Recipients to a Person table?

  2. Initially, I was thinking of simply mapping properties like email_address and street address properties directly into the things that need them, but then the possibility may arise that a person would have multiple email addresses or mailing addresses (ie: home, work, etc.). What that means is that we have a model somewhat like this:

    create table person(id int primary key auto increment, ..., 
        default_email_address);
    
    create table email_address(id int primary key auto increment, 
        email varchar(255), name varchar(255), is_default bool, person_id int);
    

    This makes things a bit complicated, as you can imagine. The name field also involves a list of default values as well as allowing custom input. I can't just make it an enum field, because the possibility exists that someone will have a lot of emails to add that could all be different... (this is the point at which I scream out "IS IT EVEN WORTH IT ANYMORE!?!?" and get frustrated with the project)

I guess what this really boils down to is the following: at what point does data normalization become ludicrous? My goal here is to create a really good as-forward-compatible-as-possible data model that I won't kick myself for creating later.

like image 271
Naftuli Kay Avatar asked Mar 29 '12 18:03

Naftuli Kay


3 Answers

at what point does data normalization become ludicrous?

At the point that it stops modelling the actual requirements.

To take your examples:

  • With the Donor and Recipient tables, if it is highly likely that any one person will become both, then it does make sense to separate out to a Person entity. If this is rare, it doesn't.

  • With the email_address and street_address situations, it depends whether you do need to store multiples or not (what is the expectation?). You may want to store separate versions per business unit (say shipping_address vs billing_address).

like image 157
Oded Avatar answered Oct 17 '22 18:10

Oded


I think the problem is not in your implementation, but rather in your analysis of the problem. Donor and Recipient are not first-class actors, they are roles of the actors. If you model them as such, you'd get a somewhat cleaner model:

  • You'd have a person table with addresses and so on
  • You'd also have an address table with addresses of the people
  • You'd also have a person_role table, with the role code (donor, recipient) and other relevant information. You may want to get fancy, and add person_donor and person_recipient, with a foreign key into the person table.
like image 34
Sergey Kalinichenko Avatar answered Oct 17 '22 19:10

Sergey Kalinichenko


Short answer: Normalization never becomes ridiculous. Most of what you're doing isn't normalization.

Longer answer

The "worst" (in truth, the "best) most designers can practically do is end up with all tables in 5NF. 5NF isn't ridiculous at all. (Yes, I know about 6NF. I'm ignoring it for didactic reasons.)

questioning whether I'm taking the right approach in trying to plan for future expansibility

That's a good question to ask yourself. It has nothing to do with normalization, though. At the conceptual level, normalization is something you do after you've decided what attributes (columns) and data need to go into your database. Experienced database designers often "think in 3NF", choosing attributes, data, and normalizing all at the same time, more or less.

Should I opt for an inheritance pattern, or should I just foreign key Donors and Recipients to a Person table?

Donors and recipients aren't different types of people. Donors are people who have made a donation. Recipients are people who have received something.

id fullname       don_date    don_amt  recip_date  recip_amt
--
1  Jamie Hubbert  2012-01-13  $20.00
1  Jamie Hubbert  2012-02-13  $17.00
2  Kelly Hawkin   2012-01-13  $50.00
2  Kelly Hawkin   2012-01-13  $20.00
3  Neva Papke                          2012-01-13  $15.00
3  Neva Papke                          2012-02-13  $15.00
2  Kelly Hawkin                        2012-01-13  $10.00
4  Jamie Hubbert  2012-01-13  $10.00   
4  Jamie Hubbert                       2012-02-13  $10.00   

During normalization, you'd identify these dependencies. (For simplicity, assumes one donation per person per date.)

  • person_id -> person_name
  • person_id -> email
  • person_id, donation_date -> donation_amount
  • person_id, recip_date -> recip_amount

Normalize to 5NF, and you'd get these three tables.

Persons
--
1  Jamie Hubbert
2  Kelly Hawkin
3  Neva Papke  
4  Jamie Hubbert

Donations
--
1  2012-01-13  $20.00
1  2012-02-13  $17.00
2  2012-01-13  $50.00
2  2012-01-13  $20.00
4  2012-01-13  $10.00   

Receipts (?)
--
3  2012-01-13  $15.00
3  2012-02-13  $15.00
2  2012-01-13  $10.00
4  2012-02-13  $10.00   

Initially, I was thinking of simply mapping properties like email_address and street address properties directly into the things that need them, but then the possibility may arise that a person would have multiple email addresses or mailing addresses (ie: home, work, etc.).

Deciding whether to support multiple email addresses, multiple mailing addresses, and different mailing and delivery addresses is a significant design decision. But it has nothing to do with normalization. Normalization, again, is something you do after you've decided which attributes and data belong in your database. So, if you were collecting representative sample data, you might end up with one of these two sets of email addresses.

Set A
1  Jamie Hubbert  [email protected]
4  Jamie Hubbert  [email protected]

Set B
1  Jamie Hubbert  [email protected]
1  Jamie Hubbert  [email protected]
4  Jamie Hubbert  [email protected]

In set A, person_id->email. In set B, it doesn't. Choosing to support the data in set A or the data in set B is a big decision, and it strongly affects what you end up with after normalizing to 5NF. But deciding which set to support has nothing to do with normalization.

As an aside, choosing to assign id numbers to non-unique email addresses is another big (and questionable) design decision. Like others, this decision has nothing to do with normalization.

(Random names courtesy of The Random Name generator.)

like image 2
Mike Sherrill 'Cat Recall' Avatar answered Oct 17 '22 17:10

Mike Sherrill 'Cat Recall'