Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tiny random-looking ID generation

I'm trying to generate unique IDs for use in a Google App Engine application and would like feedback on the feasibility of the approach I'm thinking of using (questions at the end). I've read quite a few questions on this topic, but I don't remember coming across this particular approach.

I'd like random-looking IDs, e.g., MD5 hashes, but I also want them to be small. Four to six characters, along the lines of tinyurl, would be ideal. The IDs will be for user-generated content, in the context of my application, things like test questions that people will be writing. It's not necessary that the IDs be random-looking (it's fine if they look like serial IDs), but the approach I'm thinking of using lends itself to this, so it's not really an issue.

People familiar with Google App Engine will know that writes to the data store are particularly expensive and can result in timeouts if there are too many of them to the same entity group. Sharded counters are an approach that is often used to avoid write contention on a single global counter and the failed transactions that go with it.

Along with getting short IDs and avoiding write contention, I'm trying to avoid the birthday paradox. I would like to prepare for the possibility of there being millions of IDs, even if this is going overboard a little.

I was thinking of using a sharded counter along the following lines:

  • The counter is sharded on users, so that there is a shard for each user. Each counter object has its own count that is specific to a given user, which is incremented when a new item is created by that user. The count is incremented regardless of whether an item is successfully created.
  • The basis of an ID is an MD5 hash of the following string: "<user-email-address>|<latest-counter-value>".
  • The resulting MD5 hash is then truncated, initially to four characters.
  • A single global "length" value is maintained. Whenever the previous steps result in a duplicate key (one imagines this will happen fairly quickly at first), the value of length will be increased by one. MD5 hashes for new IDs will now be truncated at "length" characters, rather than four characters.
  • I don't want to expose the user email address, which suggests that a hash of some kind would be a good way to go.

My questions are: Am I right in thinking that this will largely avoid write contention as a result of duplicate keys and that write contention on the length field would probably not be an issue, especially at longer lengths? Can anyone describe the mathematics involved here? Would the length quickly increase to near the length of an MD5 hash, calling into question the value of the whole approach? Would it be better just to go with the full (longer) MD5 hash in order to keep things easier to maintain? Is there anything that I'm overlooking?

like image 898
Eric Walker Avatar asked May 02 '09 21:05

Eric Walker


2 Answers

How about something like this:

  1. If you want 4 character keys using a-zA-Z0-9, then you'd have: 62^4 = > 14 million possible values.

  2. Break this up into N partitions: 0000 ... 1000, 1001 ... 2000 , ... , ZZAA ... ZZZZ

    Each partition is represented by an entity with: start id end id current id

Now, to generate an ID:

  1. randomly pick a partition. Use the start key for each partition as a key. Start Transaction:
  2. get_or_create() that partition entity.
  3. if current id == end id, go to step 1
  4. save current id
  5. increment current id by one End Transaction

use the current id that was saved as your id.

If you pick N as 140 you each partition would have 100,000 values. This would allow quite a few concurrent inserts while limiting the number of retries due to picking an "empty" partition.

You might need to think about this more. Especially, how will you handle the case when you need to move to 5 or 6 digit keys?

-Dave

like image 51
Dave Avatar answered Sep 20 '22 21:09

Dave


This algorithm is smart and would indeed minimize write operations. So the length value will converge to a balance between shortest length and absence of collisions.

You can relax the constrain of absence of collision by using strategies used in hash tables. Try some other unique IDs before falling back to incrementing the length.

I would thus suggest you add a test counter to the end of the hashed string initialized to 0. If the generated ID is already used, increment the counter and retry until a maximum counter value is reached after what you increase the length.

You will end up with a more efficient use of your ID address space and much less frequent length increments.

Regarding your question about the length limit of MD5 I think the choice of MD5 is an overkill. You don't really need a cryptographic (pseudo)secure hash. What you need is a random bit generator for which you can use crc32 (or adler which is faster). Especially if the code is to be run on a mobile phone. To implement a random bit generator with crc32, prepend a 32bits value to the string to hash and initialize it with a constant value of your choice (the seed). Then compute the crc32 on this string. If you need more bytes, write the obtained crc32 value in the 32bits in front of the string and recompute the crc32. Iterate until you have enough bits. You may replace the crc32 with the algorithm of your choice. This yields a cheap and fast random bit generator where the initial constant is the seed.

With this algorithm you minimize the number of random bits generated and also have no upper limit in length.

Regarding encoding, you didn't specify the constrains. Can you use upper and lower case letters with the digits ? Your example uses an alphabet of 36 different ASCII values. Once you have the pseudo random number generator described above that can generate as many bytes as desired, simply define length to be the number of letters of your ID and pick the next letter with a modulo of the next random byte. You'll know then exactly how many bytes to generate in one go and generating the ID is trivial.

like image 29
chmike Avatar answered Sep 23 '22 21:09

chmike