Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Human readable alternative for UUIDs

I am working on a system that makes heavy use of pseudonyms to make privacy-critical data available to researchers. These pseudonyms should have the following properties:

  1. They should not contain any information (e.g. time of creation, relation to other pseudonyms, encoded data, …).
  2. It should be easy to create unique pseudonyms.
  3. They should be human readable. That means they should be easy for humans to compare, copy, and understand when read out aloud.

My first idea was to use UUID4. They are quite good on (1) and (2), but not so much on (3).

An variant is to encode UUIDs with a wider alphabet, resulting in shorter strings (see for example shortuuid). But I am not sure whether this actually improves readability.

Another approach I am currently looking into is a paper from 2005 titled "An optimal code for patient identifiers" which aims to tackle exactly my problem. The algorithm described there creates 8-character pseudonyms with 30 bits of entropy. I would prefer to use a more widely reviewed standard though.

Then there is also the git approach: only display the first few characters of the actual pseudonym. But this would mean that a pseudonym could lose its uniqueness after some time.

So my question is: Is there any widely-used standard for human-readable unique ids?

like image 342
tobib Avatar asked Mar 27 '18 07:03

tobib


People also ask

Are UUIDs unique?

UUIDs are handy for giving entities their own special names, for example, in a database. There are several ways to generate them, including methods based on time, MAC addresses, hashes, and random numbers, but they make the same promise: no two are identical. Each one is unique across space and time.

What can you do with a UUID?

UUIDs are generally used for identifying information that needs to be unique within a system or network thereof. Their uniqueness and low probability in being repeated makes them useful for being associative keys in databases and identifiers for physical hardware within an organization.

Why use UUID?

Why use a UUID? đź”— The main advantage of using UUIDs is that you can create a UUID and use it to identify something, such as a row in a database, with near certainty that the identifier will not exist in another row in your system or anyone else's.


2 Answers

Not aware of any widely-used standard for this. Here’s a non-widely-used one:

Proquints

https://arxiv.org/html/0901.4016

https://github.com/dsw/proquint

A UUID4 (128 bit) would be converted into 8 proquints. If that’s too much, you can take the last 64 bits of the UUID4 (= just take 64 random bits). This doesn’t make it magically lose uniqueness; only increases the likelihood of collisions, which was non-zero to begin with, and which you can estimate mathematically to decide if it’s still OK for your purposes.

like image 181
Vasiliy Faronov Avatar answered Sep 18 '22 23:09

Vasiliy Faronov


This article suggests to use the first few characters from a SHA-256 hash, similarly to what git does. UUIDs are typically based on SHA-1, so this is not all that different. The tradeoff between property (2) and (3) is in the number of characters.

With d being the number of digits, you get 2 ** (4 * d) identifiers in total, but the first collision is expected to happen after 2 ** (2 * d).

The big question is really not about the kind of identifier you use, it is how you handle collisions.

like image 43
tobib Avatar answered Sep 19 '22 23:09

tobib