The requirement:
We have values in DB like
Chennai
Baroda
Bangalore
New Delhi
São Paulo, Lisboa
San Jose
etc...
So I want to convert these string to a unique short string. For example
Chennai –> xy67kr
San Jose –> iuj73d
basically something similar to URL shortner.
And the algorithm to convert this should be reversible.. i.e when I pass "xy67kr" to a decode function it should give me back "Chennai".
Looking forward for help.
As other posters have stated, you cannot have a function that shortens arbitrary strings, that's mathematically impossible. But you can create a custom function that works well with your particular set of strings.
An example approach would be to compute the character frequency in the set, then just code the characters with a prefix code such that the most frequent letters are encoded with short prefixes (i.e. Huffman coding.)
The approach above does not take advantage of the fact that in natural language the next character can be quite accurately predicted from previous ones, so you can extend the above algorithm so that instead of encoding the characters independently, it encodes the next character in an n-gram. This of course requires a larger compression table than the simple approach, since you are effectively having a separate code depending on the prefix. For example if 'e' is very frequent after 'th' then 'e' after 'th' is encoded with a very short prefix. If 'e' is very unfrequent after 'ee', then it can be encoded with a very long prefix in this case. The decoding algorithm obviously needs to look at the currently decompressed prefix to check how to decode the next character.
This general approach assumes that the frequencies do not change, or at least change slowly. If your data set changes than you might have to recompute the statistics and re-code the strings.
See my answer to similar question, and just rewrite it to PHP:
Encoding:
$encoded = base64_encode(gzdeflate("São Paulo, Lisboa"))
Decoding:
$decoded = gzinflate(base64_decode($encoded))
Note that gzdeflate
performs better than gzcompress
on short strings.
But anyway the problem with this is that for short strings it makes the string longer. This performs better on longer texts. It would be of course better to use some compression algorithm with a-priori information, like ppm or suffix method with initial suffix tree... then it would work perfectly on short strings also.
You cannot shorten arbitrary length strings to fixed length one.
What you can do is to create those short strings for the unique ID of the row of that specific string in the database. Here are some tips: How to design a sequential hash-like function.
This isn't necessarily deterministic, but obviously you could use a lookup table. The service would be similar to goo.gl or imgur
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With