Generating a perfect hash function given known list of strings?

Question

Suppose I have a list of N strings, known at compile-time.

I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.

For example, suppose my strings are:

 {"apple", "orange", "banana"}

Such a function may return:

f("apple") -> 2
f("orange") -> 1
f("banana") -> 3

What's a strategy to generate this function?

I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?

The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).

Ami Tavory · Accepted Answer

Say you have m distinct strings, and let a_{i, j} be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating a_{i, j} as the null character if j ≥ |a_i|.

The idea I suggest is composed of two parts:

Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.

Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.

Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.

Say the m - 1 positions are p₁, ..., p_{m - 1}. In the following, recall the meaning above for a_{i, p_j} for p_j ≥ |a_i|: it is the null character.

let us define h(a_i) = ∑_{j = 1}^{m - 1}[q_j a_{i, p_j} % n], for random q_j and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y ∧ h(x) = h(y)) ≤ 1/n.

Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m² and successively trying the above h with n = m² with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.

Generating a perfect hash function given known list of strings?

Tags:

algorithm

Andrew Tomazos

1 Answers

Ami Tavory

Recent Activity

Donate For Us

Generating a perfect hash function given known list of strings?

Tags:

algorithm

Andrew Tomazos

1 Answers

Ami Tavory

Related questions

Recent Activity

Donate For Us