I am looking for a way to create an int\long representation of an arbitrary alpha-numeric String. Hash codes won't do it, because I can't afford hash collisions i.e. the representation must be unique and repeatable.
The numeric representation will be used to perform efficient (hopefully) compares. The creation of the numeric key will take some time, but it only has to happen once, whereas I need to perform vast numbers of comparisons with it - which will hopefully be much faster than comparing the raw Strings.
Any other idea's on faster String comparison will be most appreciated too...
Unless your string is limited in length, you can't avoid collisions.
There are 4294967296 possible values for an integer (2^32). If you have a string of more than 4 ASCII characters, or more than two unicode characters, then there are more possible string values than possible integer values. You can't have a unique integer value for every possible 5 character string. Long values have more possible values, but they would only provide a unique value for every possible string of 8 ASCII characters.
Hash codes are useful as a two step process: first see if the hash code matches, then check the whole string. For most strings that don't match, you only need to do the first step, and it's really fast.
Can't you just start with a hash code, and if the hash codes match, do a character by character comparison?
How long are the strings? If they are very short, then a unique ID can be generated by considering the characters as digits in base 36 (26 + 10) that form a n-digits number where n is the length of the string. On the other hand, if the strings are short enough to allow this, direct comparison won't be an issue anyway.
Otherwise you'll have to generate a collision-free hash and this can only be done when the complete problem space is known in advance (i.e. if you know all strings that can possibly occur). You will want to have a look at perfect hashing, although the only feasible algorithm to find a perfect hash function that I know is probabilistic so collisions are still theoretically possible.
There might be other ways to find such a function. Knuth called this a “rather amusing … puzzle” in TAoCP but he doesn't give an algorithm either.
In general, you give way too few information to find an algorithm that doesn't require probing the whole problem space in some manner. This does invariably mean that the problem has exponential running time but could be solved using machine-learning heuristics. I'm not sure if this is advisable in your case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With