Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate same unique hash code for all anagrams

Recently, I attended an interview and faced a good question regarding hash collisions.

Question : Given a list of strings, print out the anagrams together.

Example :

i/p :              {act, god, animal, dog, cat}

o/p :                 act, cat, dog, god


I want to create hashmap and put the word as key and value as list of anagrams

To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.

I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list

Can anyone suggest a good algorithm ?

like image 486
Rajeev Avatar asked Sep 13 '13 07:09

Rajeev


People also ask

Do anagrams have same hash?

The hash function is good: The Fundamental Theorem of Mathematics guarantees that the prime factorization depends only on the frequency of every character in the string, and not their position. As a result, anagrams have identical hash function value.

Is a hash guaranteed to be unique?

You can always create a customized hash that guarantees uniqueness. For data in a known domain (like SSN's), the exercise is relatively simple. If your target hash value actually has more bits available than what you're hashing, the hash simply maps input values to one of the available output values.

Can two values have the same hash?

In computer science, a hash collision or clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits.


4 Answers

Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.

This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).

simple e.g. -

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word); 

[edit]

A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1*p2^n2*... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on. You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.

This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)

like image 99
Leeor Avatar answered Oct 12 '22 13:10

Leeor


Hash function : Assign primary numbers to each character. While calculating hash code, get the prime number assigned to that character and multiply with to existing value.Now all anagrams produce same hash value.

ex : a - 2, c - 3 t - 7

hash value of cat = 3*2*7 = 42 hash value of act = 2*3*7 = 42 Print all strings which are having same hash value(anagrams will have same hash value)

like image 35
4 revs Avatar answered Oct 12 '22 12:10

4 revs


The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)

My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.

def word_hash(w)
  # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
  # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
  # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
  prime_modulus = (1 << 58) - 27
  h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end

words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count

whash = {}
inverse_hash = {}
words.each do |w|
  h = word_hash(w)
  whash[w] = h
  x = inverse_hash[h]
  if x && x.each_char.sort.join != w.each_char.sort.join
    puts "Collision between #{w} and #{x}"
  else
    inverse_hash[h] = w
  end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."
like image 3
Paul Chernoch Avatar answered Oct 12 '22 11:10

Paul Chernoch


Small practical Optimization , I would suggest for the above hash method is :

Assign least prime number to vowels and then most frequently occurring consonants. Ex : e : 2 a : 3 i : 5 o : 7 u : 11 t : 13 and so on...

Also, average word length for english is : ~ 6

Also, top 26 prime numbers are less than 100 [2,3,5,7, .. , 97]

Hence, on average your hash would generate value around 100^6 = 10^12.

So there are very less chances of collision if you take prime number for modulo bigger than 10^12.

like image 3
Utsav Chokshi Avatar answered Oct 12 '22 13:10

Utsav Chokshi