Generate same unique hash code for all anagrams

Tags:

Recently, I attended an interview and faced a good question regarding hash collisions.

Question : Given a list of strings, print out the anagrams together.

Example :

i/p : {act, god, animal, dog, cat}

o/p : act, cat, dog, god

I want to create hashmap and put the word as key and value as list of anagrams

To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.

I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list

Can anyone suggest a good algorithm ?

486

asked Sep 13 '13 07:09

Rajeev

4 Answers

Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.

This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).

simple e.g. -

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word);

[edit]

A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1*p2^n2*... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on. You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.

This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)

answered Oct 12 '22 13:10

Leeor

Hash function : Assign primary numbers to each character. While calculating hash code, get the prime number assigned to that character and multiply with to existing value.Now all anagrams produce same hash value.

ex : a - 2, c - 3 t - 7

hash value of cat = 3*2*7 = 42 hash value of act = 2*3*7 = 42 Print all strings which are having same hash value(anagrams will have same hash value)

answered Oct 12 '22 12:10

4 revs

The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)

My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.

def word_hash(w)
  # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
  # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
  # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
  prime_modulus = (1 << 58) - 27
  h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end

words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count

whash = {}
inverse_hash = {}
words.each do |w|
  h = word_hash(w)
  whash[w] = h
  x = inverse_hash[h]
  if x && x.each_char.sort.join != w.each_char.sort.join
    puts "Collision between #{w} and #{x}"
  else
    inverse_hash[h] = w
  end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."

answered Oct 12 '22 11:10

Paul Chernoch

Small practical Optimization , I would suggest for the above hash method is :

Assign least prime number to vowels and then most frequently occurring consonants. Ex : e : 2 a : 3 i : 5 o : 7 u : 11 t : 13 and so on...

Also, average word length for english is : ~ 6

Also, top 26 prime numbers are less than 100 [2,3,5,7, .. , 97]

Hence, on average your hash would generate value around 100^6 = 10^12.

So there are very less chances of collision if you take prime number for modulo bigger than 10^12.

answered Oct 12 '22 13:10

Utsav Chokshi

Related questions
                            
                                The quickest escape from recursion in Java [duplicate]
                            
                                Non biased return a list of n random positive numbers (>=0) so that their sum == total_sum
                            
                                Number of 1s in the two's complement binary representations of integers in a range
                            
                                How to know the repeating decimal in a fraction?
                            
                                Check if the digits in the number are in increasing sequence in python
                            
                                Is there any algorithm for converting 2D video into 3D video?
                            
                                Text classification/categorization algorithm [closed]
                            
                                Achieve hierarchy, Parent/Child Relationship in an effective and easy way
                            
                                Find an element in an array recursively
                            
                                Image resizing algorithm
                            
                                Simple algorithm for drawing filled ellipse in C/C++
                            
                                Understanding word alignment
                            
                                How exactly do you compute the Fast Fourier Transform?
                            
                                Two Egg problem confusion
                            
                                How to get Number Of Lines without Reading File To End
                            
                                Fast Inverse Square Root on x64
                            
                                Can min/max of moving window achieve in O(N)?
                            
                                How to "zip" or "rotate" a variable number of lists?
                            
                                Is Arrays.stream(array_name).sum() slower than iterative approach?
                            
                                find an element in a sorted matrix [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generate same unique hash code for all anagrams

Tags:

algorithm

hashmap

hashtable

hash