Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.

What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.

So, I am looking for a smart way of encoding the names.

Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?

Thanks Abhishek S

like image 303
London guy Avatar asked Apr 26 '12 17:04

London guy


1 Answers

To get the same width numbers, can't you just zero-pad on the left?

Some options:

  1. Sort them. Count them. The 10th name is number 10.
  2. Treat each character as a digit in a base 26 (case insensitive, no digits) or 52 (case significant, no digits) or 36 (case insensitive with digits) or 62 (case sensitive with digits) number. Compute the value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 * 26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
  3. Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
  4. This one's mostly suggested in fun: use goedel numbering :). So "abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes. Factoring the number gives you back the characters. The numbers could get quite large though.
  5. Convert to ASCII, if you aren't already using it. Then treat each ordinal of a character as a digit in a base-256 numbering system. So "abc" is 0*256^2 + 1*256^1 + 2*256^0.

If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.

I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.

like image 59
user1277476 Avatar answered Sep 24 '22 13:09

user1277476