Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for generating a unique (constant) code for a string which should be reversible

The requirement:

We have values in DB like

Chennai
Baroda
Bangalore
New Delhi
São Paulo, Lisboa
San Jose

etc...

So I want to convert these string to a unique short string. For example

Chennai –> xy67kr

San Jose –> iuj73d

basically something similar to URL shortner.

And the algorithm to convert this should be reversible.. i.e when I pass "xy67kr" to a decode function it should give me back "Chennai".

Looking forward for help.

like image 895
Taher Avatar asked Mar 30 '12 08:03

Taher


4 Answers

As other posters have stated, you cannot have a function that shortens arbitrary strings, that's mathematically impossible. But you can create a custom function that works well with your particular set of strings.

An example approach would be to compute the character frequency in the set, then just code the characters with a prefix code such that the most frequent letters are encoded with short prefixes (i.e. Huffman coding.)

The approach above does not take advantage of the fact that in natural language the next character can be quite accurately predicted from previous ones, so you can extend the above algorithm so that instead of encoding the characters independently, it encodes the next character in an n-gram. This of course requires a larger compression table than the simple approach, since you are effectively having a separate code depending on the prefix. For example if 'e' is very frequent after 'th' then 'e' after 'th' is encoded with a very short prefix. If 'e' is very unfrequent after 'ee', then it can be encoded with a very long prefix in this case. The decoding algorithm obviously needs to look at the currently decompressed prefix to check how to decode the next character.

This general approach assumes that the frequencies do not change, or at least change slowly. If your data set changes than you might have to recompute the statistics and re-code the strings.

like image 60
Rafał Dowgird Avatar answered Nov 12 '22 03:11

Rafał Dowgird


See my answer to similar question, and just rewrite it to PHP:

Encoding:

$encoded = base64_encode(gzdeflate("São Paulo, Lisboa"))

Decoding:

$decoded = gzinflate(base64_decode($encoded))

Note that gzdeflate performs better than gzcompress on short strings.

But anyway the problem with this is that for short strings it makes the string longer. This performs better on longer texts. It would be of course better to use some compression algorithm with a-priori information, like ppm or suffix method with initial suffix tree... then it would work perfectly on short strings also.

like image 4
Tomas Avatar answered Nov 12 '22 03:11

Tomas


You cannot shorten arbitrary length strings to fixed length one.

What you can do is to create those short strings for the unique ID of the row of that specific string in the database. Here are some tips: How to design a sequential hash-like function.

like image 3
Karoly Horvath Avatar answered Nov 12 '22 03:11

Karoly Horvath


This isn't necessarily deterministic, but obviously you could use a lookup table. The service would be similar to goo.gl or imgur

like image 1
Kyle Kinkade Avatar answered Nov 12 '22 01:11

Kyle Kinkade