Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are 5381 and 33 so important in the djb2 algorithm?

Tags:

hash

The djb2 algorithm has a hash function for strings.

unsigned long hash = 5381; int c;  while (c = *str++)     hash = ((hash << 5) + hash) + c; /* hash * 33 + c */ 

Why are 5381 and 33 so important?

like image 645
Vishnu Pedireddi Avatar asked Oct 16 '09 18:10

Vishnu Pedireddi


People also ask

Why 5381 in djb2?

The starting number 5381 was picked by djb simply because testing showed that it results in fewer collisions and better avalanching. Interestingly, the choice of 33 has never been adequately explained.

What does djb2 do?

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.

Which hashing algorithm is best for uniqueness and speed?

FNV-1 (32-bit) FNV-1a (32-bit)

Which hash function is best?

Probably the one most commonly used is SHA-256, which the National Institute of Standards and Technology (NIST) recommends using instead of MD5 or SHA-1. The SHA-256 algorithm returns hash value of 256-bits, or 64 hexadecimal digits.


2 Answers

This hash function is similar to a Linear Congruential Generator (LCG - a simple class of functions that generate a series of psuedo-random numbers), which generally has the form:

X = (a * X) + c;  // "mod M", where M = 2^32 or 2^64 typically 

Note the similarity to the djb2 hash function... a=33, M=2^32. In order for an LCG to have a "full period" (i.e. as random as it can be), a must have certain properties:

  • a-1 is divisible by all prime factors of M (a-1 is 32, which is divisible by 2, the only prime factor of 2^32)
  • a-1 is a multiple of 4 if M is a multiple of 4 (yes and yes)

In addition, c and M are supposed to be relatively prime (which will be true for odd values of c).

So as you can see, this hash function somewhat resembles a good LCG. And when it comes to hash functions, you want one that produces a "random" distribution of hash values given a realistic set of input strings.

As for why this hash function is good for strings, I think it has a good balance of being extremely fast, while providing a reasonable distribution of hash values. But I've seen many other hash functions which claim to have much better output characteristics, but involved many more lines of code. For instance see this page about hash functions

EDIT: This good answer explains why 33 and 5381 were chosen for practical reasons.

like image 100
3 revs Avatar answered Dec 12 '22 08:12

3 revs


33 was chosen because:

1) As stated before, multiplication is easy to compute using shift and add.

2) As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.

3) The shift of 5 is relatively prime to 32 (the number of bits in the register), which helps with avalanching. While there are enough characters left in the string, each bit of an input byte will eventually interact with every preceding bit of input.

4) The shift of 5 is a good shift amount when considering ASCII character data. An ASCII character can sort of be thought of as a 4-bit character type selector and a 4-bit character-of-type selector. E.g. the digits all have 0x3 in the first 4 bits. So an 8-bit shift would cause bits with a certain meaning to mostly interact with other bits that have the same meaning. A 4-bit or 2-bit shift would similarly produce strong interactions between like-minded bits. The 5-bit shift causes many of the four low order bits of a character to strongly interact with many of the 4-upper bits in the same character.

As stated elsewhere, the choice of 5381 isn't too important and many other choices should work as well here.

This is not a fast hash function since it processes it's input a character at a time and doesn't try to use instruction level parallelism. It is, however, easy to write. Quality of the output divided by ease of writing the code is likely to hit a sweet spot.

On modern processors, multiplication is much faster than it was when this algorithm was developed and other multiplication factors (e.g. 2^13 + 2^5 + 1) may have similar performance, slightly better output, and be slightly easier to write.

Contrary to an answer above, a good non-cryptographic hash function doesn't want to produce a random output. Instead, given two inputs that are nearly identical, it wants to produce widely different outputs. If you're input values are randomly distributed, you don't need a good hash function, you can just use an arbitrary set of bits from your input. Some of the modern hash functions (Jenkins 3, Murmur, probably CityHash) produce a better distribution of outputs than random given inputs that are highly similar.

like image 22
Chuck Simmons Avatar answered Dec 12 '22 07:12

Chuck Simmons