Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A good hash function to use in interviews for integer numbers, strings?


I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"

like image 210
phoenix Avatar asked May 21 '11 16:05

phoenix


People also ask

What is a good hash function for integers?

A good hash function to use with integer key values is the mid-square method. The mid-square method squares the key value, and then takes out the middle r bits of the result, giving a value in the range 0 to 2r−1. This works well because most or all bits of the key value contribute to the result.

How do you choose a good hash function?

A good hash function should have the following properties: Efficiently computable. Should uniformly distribute the keys (Each table position equally likely for each key)

Which string hashing is best?

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.

What are 2 common hash functions?

The most common hash functions used in digital forensics are Message Digest 5 (MD5), and Secure Hashing Algorithm (SHA) 1 and 2.


3 Answers

Here is a simple recipe from Effective java page 33:

  1. Store some constant nonzero value, say, 17, in an int variable called result.
  2. For each significant field f in your object (each field taken into account by the equals method, that is), do the following:
    1. Compute an int hash code c for the field:
      • If the field is a boolean, compute (f ? 1 : 0).
      • If the field is a byte, char, short, or int, compute (int) f.
      • If the field is a long, compute (int) (f ^ (f >>> 32)).
      • If the field is a float, compute Float.floatToIntBits(f).
      • If the field is a double, compute Double.doubleToLongBits(f), and then hash the resulting long as in step 2.1.iii.
      • If the field is an object reference and this class’s equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If a more complex comparison is required, compute a “canonical representation” for this field and invoke hashCode on the canonical representation. If the value of the field is null, return 0 (or some other constant, but 0 is traditional). 48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
      • If the field is an array, treat it as if each element were a separate field. That is, compute a hash code for each significant element by applying these rules recursively, and combine these values per step 2.b. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
    2. Combine the hash code c computed in step 2.1 into result as follows: result = 31 * result + c;
  3. Return result.
  4. When you are finished writing the hashCode method, ask yourself whether equal instances have equal hash codes. Write unit tests to verify your intuition! If equal instances have unequal hash codes, figure out why and fix the problem.
like image 151
Premraj Avatar answered Oct 17 '22 17:10

Premraj


You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.

  • If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer

  • If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.

  • If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:

public static final int hash(int a) {         
      a ^= (a << 13);
      a ^= (a >>> 17);        
      a ^= (a << 5);
      return a;   
}

If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:

public static <T> int hashCode(T[] data) {
    int result=0;
    for(int i=0; i<data.length; i++) {
        result^=data[i].hashCode();
        result=Integer.rotateRight(result, 1);
    }
    return result;
}

Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

like image 30
mikera Avatar answered Oct 17 '22 16:10

mikera


For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago

Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..

Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.

One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..

All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

like image 2
Tony Delroy Avatar answered Oct 17 '22 16:10

Tony Delroy