Human readable alternative for UUIDs

Tags:

I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:

They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
It should be easy to create unique pseudonyms.
They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.

My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).

An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.

Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.

Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.

So my question is: Is there any widely-used standard for human-readable unique ids?

342

asked Mar 27 '18 07:03

tobib

2 Answers

Not aware of any widely-used standard for this. Here’s a non-widely-used one:

Proquints

https://arxiv.org/html/0901.4016

https://github.com/dsw/proquint

A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.

181

answered Sep 18 '22 23:09

Vasiliy Faronov

This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.

With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).

The big question is really not about the kind of identifier you use, it is how you handle collisions.

answered Sep 19 '22 23:09

tobib

Related questions
                            
                                When an array is created by a subexpression, what happens with the temporaries therein?
                            
                                IE9 letter-spacing problem
                            
                                Using duplicate parameters in a URL
                            
                                What is the rationale for not including strdup in the C Standard?
                            
                                Is it required to specify the template argument of a base class if the base class is a parameter type of a member function?
                            
                                Why does "const auto [x, y]" not behave as expected when binding to reference types?
                            
                                Is there an “official”/standard CSS3 gradient syntax?
                            
                                Why a virtual call to a pure virtual function from a constructor is UB and a call to a non-pure virtual function is allowed by the Standard?
                            
                                Why does "std::begin()" always return "const_iterator" in such a case?
                            
                                Difference between int and signed int declaration
                            
                                Why is this a forward declaration in C++?
                            
                                JQuery class selectors like $(.someClass) are case sensitive?
                            
                                Is it unspecified behavior to compare pointers to different arrays for equality?
                            
                                Why will std::sort crash if the comparison function is not as operator <?
                            
                                Why would you use "AS" when aliasing a SQL table?
                            
                                Do pointers to string literals remain valid after a function returns?
                            
                                Why is it wrong to use numbers in Java method names?
                            
                                Can a "container_of" macro ever be strictly-conforming?
                            
                                Are the platforms covered by the C standard still in use? [duplicate]
                            
                                Can a conversion from double to int be written in portable C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Human readable alternative for UUIDs

Tags:

uuid

standards

human-readable

tobib

People also ask

2 Answers

Vasiliy Faronov

tobib

Recent Activity

Donate For Us