Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a perfect hash function given known list of strings?

Tags:

algorithm

Suppose I have a list of N strings, known at compile-time.

I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.

For example, suppose my strings are:

 {"apple", "orange", "banana"}

Such a function may return:

f("apple") -> 2
f("orange") -> 1
f("banana") -> 3

What's a strategy to generate this function?

I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?

The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).

like image 841
Andrew Tomazos Avatar asked Jan 25 '26 10:01

Andrew Tomazos


1 Answers

Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.

The idea I suggest is composed of two parts:

  1. Find (at most) m - 1 positions differentiating the strings, and store these positions.

  2. Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.


Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.

Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.

Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.


let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y ∧ h(x) = h(y)) ≤ 1/n.


Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.

like image 187
Ami Tavory Avatar answered Jan 28 '26 05:01

Ami Tavory



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!