Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring a hash functions quality (for use with maps/assosiative arrays)

I'm looking into an associative array library in C (which I didn't write). Similar to a map in C++ or Python's dict's.

There are some non-standard hashing functions which I'm not certain if they are even very good. (maybe the original developer just threw some magic-numbers, xor-operators and hoped for the best).

I wrote a test which measures how well the hashing function performs given some sample input, to measure how evenly it distributes items into a fixed number of buckets (modulo array size in this case).

This way, given enough input, there would be some way to measure how well a hash function performs.

It seems like it must be a common issue for anyone writing an associative array.


Is there some convention for measuring how well a hash function performs? (in terms of distribution quality, not speed).

Where worst would be the same result for every input, and best would give an even distribution (or as close as possible).

Note, I'm not looking for cryptographic strength here.

like image 638
ideasman42 Avatar asked Dec 09 '25 21:12

ideasman42


1 Answers

There is a Formula (in the middle of the page) from the dragon book.

I personally have a rule of thumb: (assuming linear chaining) Insert N items into N slots->chains, and compute the total number of accesses (first in chain := 1 access; 2nd := 2 accesses, etc) needed to obtain all N elements. (this is equal SUM ( chainlen * (chainlen +1) /2) , summed over all chains)

Given random input data, for any reasonable hashfunction this metric should be 1.5 * N, or a bit below that.


Example of a typical run using a list of 2543846 unique tokens/words (and their statistics) hashed into exactly 2543846 slots/buckets:

plasser@pisbak:~/src/hash$ ./diskhash woorden.txt woorden.hsh
Ptr = 0x7fb5c264f000, Sz = 37362821
Array= 0x7fb5bff7e000 Cnt = 2543846
__________________
Histogram of seek lenghts:
len:    Count     Hops   Fraction (Cumulative)
  1:  1606429  1606429 0.63149617 (0.63149617)
  2:   672469  1344938 0.26435130 (0.89584747)
  3:   205046   615138 0.08060472 (0.97645219)
  4:    48604   194416 0.01910650 (0.99555869)
  5:     9477    47385 0.00372546 (0.99928415)
  6:     1581     9486 0.00062150 (0.99990565)
  7:      215     1505 0.00008452 (0.99999017)
  8:       24      192 0.00000943 (0.99999961)
  9:        1        9 0.00000039 (1.00000000)
Tot:  2543846  3819498           (1.50147)
Cnt  2543846 Empty   937417 (0.36850) Collisions 247 RedDragon 7638996/7631537=1.000977
__________________
  • the fraction of empty slots is 0.36850 , which is about what it should be (1/e)
  • the fraction of slots with more than one item (chain-length > 1) is also about (1/e)
  • the fraction of slots with exactly 1 item is what is left :: 1 - (2/e)
  • the number of collisions seems a bit high, but with 2.5 M items on a 32bits hashvalue it is not exceptional.
like image 112
wildplasser Avatar answered Dec 11 '25 17:12

wildplasser



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!