Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does this hash function work? Are these numbers random?

I'm currently reading K&R's "The C Programming Language" book. In "Structures" chapter, under the subtopic of "Table Lookup" (Page 144) I found this hash generating function

#define HASHSIZE 101

struct nlist {
    struct nlist *next;
    char *name;
    char *defn;
}

static struct nlist *hashtab[HASHSIZE];

unsigned hash(char *s)
{
    unsigned hashval;

    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31 * hashval;
    return hashval % HASHSIZE;
}

What I don't understand is what this function actually does.

I think it generates an unique address (as an index on hashtab) for the given string (char *s).

But I think two different strings can be given the same index since (hashval % HASHSIZE) is the given address (203 % 101 = 405 % 101 = 1).

And why is HASHSIZE 101 and hashval multiplied by 31? Why not 100 or 32?

like image 687
0xEDD1E Avatar asked Nov 08 '15 08:11

0xEDD1E


2 Answers

What I don't understand is what this function actually do?

It basically hashes the string pointed to by the char *s pointer, until it encounters the end of the string, which is marked by the null character '\0'. In other words, it calculates (or maps) a given input string to an integer value.

You can also see that it does this by going through each character in the string (i.e. the s++), making the time complexity of this function linearly dependent on string length --or O(N)-- and that it avoids generating a value that goes beyond the bounds of the array with the final modulus operation.

I think it generates an unique address (as an index on hashtab) for the given string(char *s).

It takes the input value (i.e. the string being hashed) and uses it to find out the index within the array in which the string should be placed. So it's not technically generating an address because the function does not return a pointer. The word offset would be more accurate here.

But I think two different strings can be given the same index since (hashval % HASHSIZE) is the given address (203 % 101 = 405 % 101 = 1).

True. This is called a collision. Writing hash functions that are good at avoiding collisions is not easy. In most discussions, you'll see methods for collision resolution in order to handle these cases.

For example, one method could be to turn each array element into a pointer to a linked list where the elements that have collided (i.e. hashed the same index value) are appended to. There're other methods, but that's a different discussion.

Ideally, perfect hash functions would be used because they're guaranteed to never generate the same hash value for two different inputs, making collision resolution unnecessary.

There're book chapters written on these topics, mainly when it comes to searching, so you might want to give those a read.

And Why HASHSIZE is 101 and hashval is multiplied by 31 (why not 100 or 32)?

Because 101 and 31 are prime numbers and therefore less likely to end up generating collisions by multiplying/dividing themselves into the same bucket as a previous, and different, string.

like image 69
code_dredd Avatar answered Nov 15 '22 23:11

code_dredd


Hash functions in general might generate the same hash value for different strings. Thats why a collision resolution is needed.

About the value for HASHSIZE and hashval: I am not a expert in hash functions, but in the few ones I read about, the numbers used were empirically obtained. You can read the answer to this other topic, this might help you.

like image 6
Rodrigo Vianna Avatar answered Nov 15 '22 22:11

Rodrigo Vianna