Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a string hash function that supports h(x) + h(y) = h(x+y)

I'm trying to save space by using hash values of strings. I have a very specific requirement, the simplified description of which is as follows:

I have two sets of string values and a value is provided in runtime. I need to get a list of all strings from the second set that starts with a string from the first set and ends with the query value. Here is a significantly simplified representation and description:

set1:
my_test_val_1
my_test_val_2

set2:
my_test_val_1_extended_to_another_value
my_test_val_2_extended_as_well

My aim is to keep hash values of these sets as in:

set1:
hash(my_test_val_1)
...

set2:
hash(my_test_val_1_extended_to_another_value)

to save on space and when '_extended_to_another_value' arrives as a query, use the hash function with distributive property over addition to do:

hash(my_test_val_1) + hash('_extended_to_another_value') = hash_value_to_search

My search attempts to find a hash function that supports this property has failed most probably due to not using the right keywords for search, so even if you can describe the right terms for what I'm describing above, it would help

like image 978
mahonya Avatar asked Nov 09 '22 17:11

mahonya


1 Answers

Here is one:

import java.util.Random;
public class StringHasher {
    private static int[] CHAR_HASHES = new int[65536];
    static {
        Random rng = new Random();
        for(int k = 0; k < 65536; k++)
            CHAR_HASHES[k] = rng.nextInt();
    }
    public static int hash(String s) {
        int result = 0;
        for(int k = 0; k < s.length(); k++) {
            result += CHAR_HASHES[s.charAt(k)];
        }
        return result;
    }
}

It turns out that any such hash must be constructed by adding up all the hashes of the string's constituent characters - otherwise for example h("hello") = h("h") + h("e") + h("l") + h("l") + h("o") would not hold.

Note: this means you cannot have a very collision-resistant hash, as every string containing the same characters will have the same hash, per the previous paragraph.

Choosing random values for the hash of each single-character string should provide close to the best possible collision resistance, on average. This does waste 256 KiB of memory, and isn't the fastest possible method, and is not repeatable, but it suffices for a proof-of-concept.

like image 146
user253751 Avatar answered Nov 15 '22 12:11

user253751