Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate unique id from unique string input

I have a table with a column of unique string values. The max length of the string value is 255 char. I want to generate a unique id with the string value as input. In other words I am looking for a compact representation for a string. The unique id generated can be alpha-numeric. A useful feature to have would be to be able to regenerate the string value from the unique id.

Is there an efficient function to generate such an unique id. Some ways could be using checksum or hash functions. I want to know if there is a standard way to do this.

I am using MySql database and java.

Thanks!

--edit: I am looking for a more compact representation rather than just using the string itself.

like image 909
pkrish Avatar asked Jun 08 '26 18:06

pkrish


2 Answers

How unique is "unique"? Using any good hashing function (MD5 is decent for most uses, and easily implemented via java.security.MessageDigest.getInstance("MD5") can get you to a 128-bit number that's very very likely to be unique. Using a subset of the hash gets you a smaller ID, with a higher chance of collision.

Using an auto_increment field in the DB, if it fits your design, might be easier to implement, will truly guarantee uniqueness, and will use smaller IDs than the 16 bytes of MD5. You can also then meet your requirement of finding the string by the key, which you can't do for a hash.

like image 90
Dagon Avatar answered Jun 11 '26 09:06

Dagon


If you want to be able to go "Backwards" (The indexes are guaranteed unique and can be reversed to the original strings) what you are doing is compression, anything that isn't compression (checksum) is not reversable.

To do the compression, the simplest way would be to bit-pack and get each character down to the bare minimum number of bits.

A-Z is 26 chars which is less than 32 (5 bits)

add a-z and it's 6 bits (with somewhere around 12 bit-patterns left over to represent other characters).

Let's say that is enough for you. So you have 6x255 bits which is 1530 bits to store your string. (191 bytes)

Going with only caps would reduce that a little (to 159 bytes)

You can optimize it more, but then you have to go into a compression algorithm that expects a specific language or patterns in the Strings and optimizes those patterns.

Unless you can further specify the contents of the strings, you're just not going to get what you want. Sorry. (If you can tell more about the contents of the strings, do so. One of us may see patterns that will allow much better "Compression")

This lack of ability to do what you want is why hashtables are so cool. They get a "Mostly Unique" number and then have a second level of resolution to test cases where two strings hashed to the same number.

like image 32
Bill K Avatar answered Jun 11 '26 11:06

Bill K