Question
Is it a good rule of thumb for database IDs to be "meaningless?" Conversely, are there significant benefits from having IDs structured in a way where they can be recognized at a glance? What are the pros and cons?
Background
I just had a debate with my coworkers about the consistency of the IDs in our database. We have a data-driven application that leverages spring so that we rarely ever have to change code. That means, if there's a problem, a data change is usually the solution.
My argument was that by making IDs consistent and readable, we save ourselves significant time and headaches, long term. Once the IDs are set, they don't have to change often and if done right, future changes won't be difficult. My coworkers position was that IDs should never matter. Encoding information into the ID violates DB design policies and keeping them orderly requires extra work that, "we don't have time for." I can't find anything online to support either position. So I'm turning to all the gurus here at SA!
Example
Imagine this simplified list of database records representing food in a grocery store, the first set represents data that has meaning encoded in the IDs, while the second does not:
ID's with meaning:
Type
1 Fruit
2 Veggie
Product
101 Apple
102 Banana
103 Orange
201 Lettuce
202 Onion
203 Carrot
Location
41 Aisle four top shelf
42 Aisle four bottom shelf
51 Aisle five top shelf
52 Aisle five bottom shelf
ProductLocation
10141 Apple on aisle four top shelf
10241 Banana on aisle four top shelf
//just by reading the ids, it's easy to recongnize that these are both Fruit on Aisle 4
ID's without meaning:
Type
1 Fruit
2 Veggie
Product
1 Apple
2 Banana
3 Orange
4 Lettuce
5 Onion
6 Carrot
Location
1 Aisle four top shelf
2 Aisle four bottom shelf
3 Aisle five top shelf
4 Aisle five bottom shelf
ProductLocation
1 Apple on aisle four top shelf
2 Banana on aisle four top shelf
//given the IDs, it's harder to see that these are both fruit on aisle 4
Summary
What are the pros and cons of keeping IDs readable and consistent? Which approach do you generally prefer and why? Is there an accepted industry best-practice?
-------- edit ( helpful background info from comments, below ): --------
In our tables, the Primary Key is always an ID field containing a unique integer. At first, that integer was arbitrary. Over time, some of these IDs naturally took on meaning among developers/testers. During a recent refactor, certain developers also took time to make all IDs easier to recognize. It made everyone's job 100X easier. Some people (who don't actually use the data/code) vehemently disagreed for theoretical reasons. In practice, not one of those objections are holding true. Moreover, all developers using the data agree that it's now significantly easier to maintain.
I'm looking for (but haven't seen) a defensible argument against using immediately recognizable IDs in a data-centric environment.
Con: I just changed "Aisle Five top shelf" to "Aisle Six top shelf" so now I have to change its ID to 61 and now I have to change the ProductLocation ID for "Grapes on Aisle five top shelf" to be 10461 and oh god where else does the shelf location ID string show up in IDs in my database oh god whoever designed IDs to carry meaning ought to be shot it's four in the morning and everything's gone crazy and why does "Aisle seven bottom shelf" have an ID of 41 die die die.
There are several problems with using database IDs to encode information about a row. If you want your carrots to have an "ID" of 203, you should add a product_id
column (for example) and put this information there instead. Why?
The only required purpose of an ID is to uniquely identify a row within a table. If it can provide good lookup performance, that's a bonus, and if it can be compactly stored, that's another bonus. But it shouldn't contain any information about the entity in the row it identifies, other than the unique identifier of that entity.
Well, given your 10141
"Apple is in aisle four", what happens when you end up with product 10
in aisle 1
on shelf 41
? Or is that product 1
in aisle 014
on shelf 1
, or is it product 101
in aisle 41
sitting on the floor because it's not on a shelf?
Once you start co-mingling data like that, you generally lose any ability to reliably extract the components. Human-readable keys are all nice and dandy, but you never destroy the individual IDs the human form is based on.
What do you mean by "readable"? IDs are typically just numbers. And what do you mean by "consistent"? IDs are typically just incrementing numbers; you can't get much more consistent than that. Why waste time and effort trying to encode information into the ID, when the information will already be present explicitly in the database? Who would make use of "orderly" IDs?
Here is my take on Surrogate keys. (or ID's if you want to call them that)
Surrogate keys have no business meaning. They are used to uniquely identify the row. But they do more than merely identify the row. They are also the "soul" of the row. It can't be changed or traded. If the surrogate follows the "soul" principle then when you delete the row, a new row will never take the dead row's value. The soul still belongs to the deleted row even after it's dead and gone.
I like my surrogates to be "souls", though that is not required to be a surrogate.
The advantage of a surrogate is it never needs to change. If 30 other tables have a foriegn key to your main table, you don't want to update all 30 when the main table's PK changes. You can still have a CANDIDATE key on that potentially changing value, but since it can change it is not the row's soul.
Surrogate keys are often auto-increment integers. This lends itself PERFECTLY for clustered indexes. Your table joins will be as good as they can possibly be. Natural keys tend to make horrible clustered indexes as new values are rarely sequential. Integers are small, fixed length data types for even faster matching.
If your name changes, you're still you. If you burn off your finger prints, you're still you. God is using a surrogate key, so I think it's OK for us to use them in our databases.
EDIT After reading your question more carefully, I think you are actually using "meaningless keys" just in the wrong way.
You have value "10141" to represent the apple/location association. That is the combination of 2 surrogates into 1 field. Keep them as separate fields "101" and "41" and make the PK on the combo of those fields. Keeping them separate will make it easier to search, index, table join, etc.
You are right, you don't need yet another surrogate on the mapping table. The combo of 2 surrogates is a surrogate in it's own right (though not a soul). Just express the combo in 2 separate columns, not combined into 1 column. END EDIT
Meaningful ids are not against "db design policies"!
Quite the opposite, it's exactly what the real relational databases were about from the day one. If your data contains some combination of attributes that is - from the business point of view - unique, NOT making it an ID will usually break the Boyce-Codd normal form. And bring the anomalies that go with it.
Unless the information encoded in ID is redundant with what's in other fields, just use it. If it is redundant, make a multiple column primary key. They are not very handy with ORMs, but in data driven applications they are a blessing.
ADDENDUM: (after the edit of original question)
In your case, for a data driven application, I would do:
Type
==========
Fruit
Veggie
Product
==========
Apple Fruit
Banana Fruit
Orange Fruit
Lettuce Veggie
Onion Veggie
Carrot Veggie
Isle
==========
4
5
Shelf
==========
top
bottom
Location
==========
4 top
4 bottom
5 top
5 bottom
ProductLocation
==========
Apple 4 top
Banana 4 top
With such setup:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With