I have many integers in range [0; 2^63-1]. There is only 10^8 integers, however. There is no duplicates. Full list is known at compile-time but it is just unique random numbers. These numbers never changes.
To store one integer explicitly, 8 bytes required, and there is associated 1-byte values, so explicit storing requires about 860 MB.
So I want to find minimal perfect hash function to map each of 10^8 integers from [0;2^63-1] to [0;10^8-1]. I should find this function only once, data never changes, and function can be complicated. But it should be minimal, perfect, and calculating should be fast. How I can do this better? Maybe it is possible to find and use some subsequences if they happens?
Thanks.
A perfect hash function can be constructed that maps each of the keys to a distinct integer, with no collisions. These functions only work with the specific set of keys for which they were constructed. Passing an unknown key will result a false match or even crash. A minimal perfect hash function goes one step further.
Perfect hashing is defined as a model of hashing in which any set of n elements can be stored in a hash table of equal size and can have lookups performed in constant time. It was specifically invented and discussed by Fredman, Komlos and Szemeredi (1984) and has therefore been nicknamed as "FKS Hashing".
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to implement a lookup table with constant worst-case access time.
Probably the one most commonly used is SHA-256, which the National Institute of Standards and Technology (NIST) recommends using instead of MD5 or SHA-1. The SHA-256 algorithm returns hash value of 256-bits, or 64 hexadecimal digits.
Let your computer do the work for you:
http://www.gnu.org/software/gperf/
Quote: "GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only. "
I'm working on an algorithm and Java implementation that needs less than 1.6 bits per key.
Previously, I have implemented a minimal perfect hash function tool in Java that needs less than 2.0 bits per key.
Other algorithms are implemented in CMPH. For example CHD needs about 2.06 bits per key by default. It can be configured to use less space, but generation is then slower.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With