Pros. Using UUID for a primary key brings the following advantages: UUID values are unique across tables, databases, and even servers that allow you to merge rows from different databases or distribute databases across servers. UUID values do not expose the information about your data so they are safer to use in a URL.
Primary keys should never be exposed, even UUIDsA primary key is, by definition unique within its scope. It is, therefore, an obvious thing to use as a customer number, or in a URL to identify a unique page or row. Don't! I would argue that using a PK in any public context is a bad idea.
You can use UUID as primary key in your table as it will be unique. However do keep in mind that UUID will occupy a bit more space as compared to SEQUENCE. And also they are not very fast. But yes they are for sure unique and hence you are guaranteed to get a consistent data.
If your database is or will eventually be distributed, like in the case of a local-first application, or simply if your NoSQL database is scaling up and divided upon multiple servers, I'd say that you have almost non choice : Use UUID! Just know that there is some things that you can do to improve performance.
UUIDs
Unless these are generated "in increasing monotonic sequence" they can drastically hurt/fragment indexes. Support for UUID generation varies by system. While usable, I would not use a UUID as my primary clustered index/PK in most cases. If needed I would likely make it a secondary column, perhaps indexed, perhaps not.
Some people argue that UUIDs can be used to safely generate/merge records from an arbitrary number of systems. While a UUID (depending upon method) generally has an astronomically small chance of collision, it is possible to -- at least with some outside input or very bad luck :) -- generate collisions. I am of the belief that only a true PK should be transmitted between systems, which I would argue is not (or should not be) a database-generated UUID in most cases.
autoincrement/sequence keys and sequence tables
This really depends on what the database supports well. Some databases support sequences which are more flexible that a simple "auto-increment". This may or may not be desirable (or may be the only way for this kind of task simply, even). Sequence tables are generally more flexible yet, but if this kind of "flexibility" is needed I would be tempted to go back and visit the design-pattern, especially if it involves the use of triggers. While I dislike "limiting ORMs", that may also make a difference in choosing the "simpler" auto-increment or sequence types/database support.
Regardless of the method used, when using surrogate primary keys, the true primary key should still be identified and encoded into the schema.
In addition, I argue that "security compromises through exposing an auto-sequence PK" are a result of incorrectly exposing an internal database property. While a very simple way to handle CRUD operation, I believe there is a distinction between the internal keys and the exposed keys (e.g. pretty customer number).
Just my two cents.
Edit, additional replies to Tim:
I think the generated vs. true PK question is a very good one and one I need to consider also. I'd like UUIDs in general to the points you make. My hesitation was in size vs. an int/long. Was not aware of potential indexing de-optimizations, which is a much bigger concern for me.
I wouldn't really worry about the size -- if a UUID is best, then it's best. If it's not, then it's not. In the overall scheme the extra 12bytes over an int likely won't make much of a difference. SQL Server 2005+ supports the newsequentialid UUID generation function to avoid the fragmentation associated with normal UUID generation. The page discusses it some. I am sure that other databases have similar solutions.
And by "encoded into the schema", do you mean more than adding a uniqueness constraint?
Yes. The primary key doesn't have to be the only [unique] constraint. Just using a surrogate PK doesn't mean the database model should be compromised :-) Additional indexes can also be used to cover, etc.
And by "distinction between", are you saying that surrogate primary keys never leak out?
The wording in my initial post was a tad hard. It's not "never" so much as "if they do and it matters then that's another problem". Often times people complain of insecurity through guessable numbers -- e.g. if your order is 23 then there is likely an order 22 and 24, etc. If this is your "protection" and/or can leak sensitive information then the system is already flawed. (Separating internal and external ids does not inherently fix this issue and authentication/authorization is still required. However, it is one issue raised against using "sequential ids" -- I find encoding a nonce into distributed URLs handles this for my use-case rather well.)
More to what I really wanted to get across: Just because the surrogate PK id happens to be 8942 doesn't mean that it's order 8942. That is, keeping with the "some fields are internal only to db" design, the order "number" might be entirely unrelated on the surface (but fully supported in the DB model), such as "#2010-42c" or whatever makes sense for the business requirement(s). It is this external number that should be exposed in most cases.
I feel that sometimes the generated key is really the true primary key as other fields are mutable (eg. user may change email and username).
This may be the case within a database and I will not argue this statement. However, once again holding that the surrogate PK's are internal to the database, just make sure to only export/import tuples that can be well-identified. If the username/email may change, then this might very well include a UUID assigned upon account creation -- and could very well be the surrogate PK itself.
Of course, as with everything, remain open and fit the model to the problem, not the problem to the model :-) For a service like twitter, for instance, they use their own number generation schema. See Twitter's new ID generation. Unlike [some] UUID generation, the approach by twitter (assuming that all the servers are correctly setup) guarantees that none of the distributed machines/processes will ever generate a duplicate ID, requires only 64-bits, and maintains rough ordering (most significant bits are time-stamp). (The number of records generated by twitter may be in no way related to local requirements ;-)
Happy coding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With