Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I encode a string to an integer in a way that preserves lexicographic string closeness?

I would like to encode strings of varying length (typically 1-100 characters) to integers in a way that strings which are lexicographically similar (they would be close together in a dictionary) result in integers that are close together, while further ensuring that these integers are reasonably evenly distributed across the range of possible integer values.

I recognize that ensuring an even distribution may require some kind of survey of the possible strings prior to encoding them.

Does anyone have any ideas on how to do this?

like image 797
sanity Avatar asked Nov 25 '11 18:11

sanity


3 Answers

A general approach would be using the first n characters in your string, zero-byte-padded as necessary, as an integer number. Reduce your alphabet accordingly, and you should achieve a fairly dense packing. Example: Assume your input alphabet is Base64 with / representing the end of string. You'd hash the string 'word/' by setting the six highest-most bits of your integer to 48, the next six to 40, and so on. Pad with two zeros, and you have got an exact representation in a 32 bit integer.

Lexicographically close words will have similar beginnings and thereby similar most-significant bits.

Naturally, words longer than 5 characters have hash collisions, but that can't be avoided.

like image 61
thiton Avatar answered Oct 15 '22 02:10

thiton


Your requirements are pretty tight. How about using a minimal perfect hash function? This ensures that if you give the strings in lexicographic order:

s1 < s2 < s3 < s4 < ... < sN

they will be mapped to consecutive integers in the range [0..N-1]. See these papers:

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2010/01_appoggiomg-minordhash.pdf

http://vigna.dsi.unimi.it/ftp/papers/MonotoneMinimalPerfectHashing.pdf

like image 40
Tudor Avatar answered Oct 15 '22 01:10

Tudor


This is not possible. Assume you came up with some function to map strings to integers. Then let's say you mapped the first input string, s1, to an integer, i1, and mapped the second input string, s2, to i2. The problem then lies with the input strings that follow. You only have room for |i2 - i1| more input strings that fall between s1 and s2. But there's no way to guarantee that you won't receive more than |i2 - i1| strings that fall between s1 and s2, at least not practically (you'd have to use integers on the order of 26^100 for strings of a single case with up to 100 characters).

like image 20
mbeckish Avatar answered Oct 15 '22 03:10

mbeckish