Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pros and cons of making database IDs consistent and "readable" [closed]

Question

Is it a good rule of thumb for database IDs to be "meaningless?" Conversely, are there significant benefits from having IDs structured in a way where they can be recognized at a glance? What are the pros and cons?

Background

I just had a debate with my coworkers about the consistency of the IDs in our database. We have a data-driven application that leverages spring so that we rarely ever have to change code. That means, if there's a problem, a data change is usually the solution.

My argument was that by making IDs consistent and readable, we save ourselves significant time and headaches, long term. Once the IDs are set, they don't have to change often and if done right, future changes won't be difficult. My coworkers position was that IDs should never matter. Encoding information into the ID violates DB design policies and keeping them orderly requires extra work that, "we don't have time for." I can't find anything online to support either position. So I'm turning to all the gurus here at SA!

Example

Imagine this simplified list of database records representing food in a grocery store, the first set represents data that has meaning encoded in the IDs, while the second does not:


ID's with meaning:

Type
1 Fruit
2 Veggie

Product
101 Apple
102 Banana
103 Orange
201 Lettuce
202 Onion
203 Carrot

Location
41 Aisle four top shelf
42 Aisle four bottom shelf
51 Aisle five top shelf
52 Aisle five bottom shelf

ProductLocation
10141 Apple on aisle four top shelf
10241 Banana on aisle four top shelf
//just by reading the ids, it's easy to recongnize that these are both Fruit on Aisle 4

ID's without meaning:

Type
1 Fruit
2 Veggie

Product
1 Apple
2 Banana
3 Orange
4 Lettuce
5 Onion
6 Carrot

Location
1 Aisle four top shelf
2 Aisle four bottom shelf
3 Aisle five top shelf
4 Aisle five bottom shelf

ProductLocation
1 Apple on aisle four top shelf
2 Banana on aisle four top shelf
//given the IDs, it's harder to see that these are both fruit on aisle 4

Summary

What are the pros and cons of keeping IDs readable and consistent? Which approach do you generally prefer and why? Is there an accepted industry best-practice?

-------- edit ( helpful background info from comments, below ): --------

In our tables, the Primary Key is always an ID field containing a unique integer. At first, that integer was arbitrary. Over time, some of these IDs naturally took on meaning among developers/testers. During a recent refactor, certain developers also took time to make all IDs easier to recognize. It made everyone's job 100X easier. Some people (who don't actually use the data/code) vehemently disagreed for theoretical reasons. In practice, not one of those objections are holding true. Moreover, all developers using the data agree that it's now significantly easier to maintain.

I'm looking for (but haven't seen) a defensible argument against using immediately recognizable IDs in a data-centric environment.

like image 414
gMale Avatar asked Feb 09 '11 22:02

gMale


6 Answers

Con: I just changed "Aisle Five top shelf" to "Aisle Six top shelf" so now I have to change its ID to 61 and now I have to change the ProductLocation ID for "Grapes on Aisle five top shelf" to be 10461 and oh god where else does the shelf location ID string show up in IDs in my database oh god whoever designed IDs to carry meaning ought to be shot it's four in the morning and everything's gone crazy and why does "Aisle seven bottom shelf" have an ID of 41 die die die.

like image 144
CanSpice Avatar answered Nov 11 '22 13:11

CanSpice


There are several problems with using database IDs to encode information about a row. If you want your carrots to have an "ID" of 203, you should add a product_id column (for example) and put this information there instead. Why?

  1. By customizing your IDs, you have to add domain-specific code that manages your IDs and you can't rely on database features like auto-incrementing or UUIDs.
  2. If you ever have to change your classification, it will mess up your table relations, browser bookmarks, search engine results, etc.
  3. It's not common practice -- so when you put application- or domain-specific data into the ID field, it will be assumed by many that this is meaningless information, when it isn't. You will need a data dictionary (and you will have to ensure that people read the data dictionary) to note the fact that this is valuable information.

The only required purpose of an ID is to uniquely identify a row within a table. If it can provide good lookup performance, that's a bonus, and if it can be compactly stored, that's another bonus. But it shouldn't contain any information about the entity in the row it identifies, other than the unique identifier of that entity.

like image 34
alexantd Avatar answered Nov 11 '22 13:11

alexantd


Well, given your 10141 "Apple is in aisle four", what happens when you end up with product 10 in aisle 1 on shelf 41? Or is that product 1 in aisle 014 on shelf 1, or is it product 101 in aisle 41 sitting on the floor because it's not on a shelf?

Once you start co-mingling data like that, you generally lose any ability to reliably extract the components. Human-readable keys are all nice and dandy, but you never destroy the individual IDs the human form is based on.

like image 40
Marc B Avatar answered Nov 11 '22 14:11

Marc B


What do you mean by "readable"? IDs are typically just numbers. And what do you mean by "consistent"? IDs are typically just incrementing numbers; you can't get much more consistent than that. Why waste time and effort trying to encode information into the ID, when the information will already be present explicitly in the database? Who would make use of "orderly" IDs?

like image 22
Oliver Charlesworth Avatar answered Nov 11 '22 14:11

Oliver Charlesworth


Here is my take on Surrogate keys. (or ID's if you want to call them that)

Surrogate keys have no business meaning. They are used to uniquely identify the row. But they do more than merely identify the row. They are also the "soul" of the row. It can't be changed or traded. If the surrogate follows the "soul" principle then when you delete the row, a new row will never take the dead row's value. The soul still belongs to the deleted row even after it's dead and gone.

I like my surrogates to be "souls", though that is not required to be a surrogate.

The advantage of a surrogate is it never needs to change. If 30 other tables have a foriegn key to your main table, you don't want to update all 30 when the main table's PK changes. You can still have a CANDIDATE key on that potentially changing value, but since it can change it is not the row's soul.

Surrogate keys are often auto-increment integers. This lends itself PERFECTLY for clustered indexes. Your table joins will be as good as they can possibly be. Natural keys tend to make horrible clustered indexes as new values are rarely sequential. Integers are small, fixed length data types for even faster matching.

If your name changes, you're still you. If you burn off your finger prints, you're still you. God is using a surrogate key, so I think it's OK for us to use them in our databases.

EDIT After reading your question more carefully, I think you are actually using "meaningless keys" just in the wrong way.

You have value "10141" to represent the apple/location association. That is the combination of 2 surrogates into 1 field. Keep them as separate fields "101" and "41" and make the PK on the combo of those fields. Keeping them separate will make it easier to search, index, table join, etc.

You are right, you don't need yet another surrogate on the mapping table. The combo of 2 surrogates is a surrogate in it's own right (though not a soul). Just express the combo in 2 separate columns, not combined into 1 column. END EDIT

like image 41
Lord Tydus Avatar answered Nov 11 '22 13:11

Lord Tydus


Meaningful ids are not against "db design policies"!

Quite the opposite, it's exactly what the real relational databases were about from the day one. If your data contains some combination of attributes that is - from the business point of view - unique, NOT making it an ID will usually break the Boyce-Codd normal form. And bring the anomalies that go with it.

Unless the information encoded in ID is redundant with what's in other fields, just use it. If it is redundant, make a multiple column primary key. They are not very handy with ORMs, but in data driven applications they are a blessing.

ADDENDUM: (after the edit of original question)

In your case, for a data driven application, I would do:

Type
==========
Fruit
Veggie

Product
==========
Apple    Fruit
Banana   Fruit
Orange   Fruit
Lettuce  Veggie
Onion    Veggie
Carrot   Veggie

Isle
==========
4
5

Shelf
==========
top
bottom

Location
==========
4   top
4   bottom
5   top
5   bottom

ProductLocation
==========
Apple    4  top
Banana   4  top

With such setup:

  • the data is normalized
  • you can SEE the location of any product in the ProductLocation table - you can even see the shelf
  • no surrogates
  • depending on types of queries, this structure can actually perform better than other propositions, because it requires less joins (or it may be slower, because it requires more storage).
  • this will work best with RDBMSs that support an "on replace update" constraint.
  • if you want to treat names as ids, you probably need to add some column like 'display name". That's because people want to change what's displayed more often than they want to change identity of thigns.
like image 28
fdreger Avatar answered Nov 11 '22 13:11

fdreger