I have a hashtable that stores quadtree entries.
The hashfunction looks like this:
Quadtree hash
#define node_hash(a,b,c,d) \
(((int)(d))+3*(((int)(c))+3*(((int)(b))+3*((int)(a))+3)))
Note that the result of this operation is always chunked down using a modulus prime number like so:
h = node_hash(p->nw, p->ne, p->sw, p->se) ;
h %= hashprime ;
...
Comparison with optimal hash
Some statistical analysis shows that this hash is optimum in terms of collision reduction.
Given a hashtable with b
buckets and n
entries. The collision risk using a perfect hash is:(n - b * (1 - power((b-1)/b,n)))) * 100 / n
When n = b this means a collision risk of 37%.
Some testing shows that the above hash lines up very nicely with the norm (for all fill levels of the hashtable).
Running time
The runtime is heavily dependent on the value of hashprime
Timings (best out of 1000 runs) are:
hashprime CPU-cycles per run
--------------------------------
4049 56
16217 68
64871 127 <-- whoooh
Is there a way to improve on this, whilst still retaining the optimum collision risk?
Either by optimizing the modulus operation (replacing it with a multiplication using 'magic' numbers computer outside the loop).
Replacing the hash function with some other hash function.
Background
The following assembly is produced:
//--------h = node_hash(p->nw, p->ne, p->sw, p->se) ;
mov eax,[rcx+node.nw] <<+
lea eax,[eax+eax*2+3] |
add eax,[rcx+node.ne] |
lea eax,[eax+eax*2] +- takes +/- 12 cycles
add eax,[rcx+node.sw] |
lea eax,[eax+eax*2] |
add eax,[rcx+node.se] <<+
//--------h %= hashprime ;
mov esi,[hashprime]
xor edx,edx
div esi
mov rax,rdx <<--- takes all the rest
[EDIT]
I may be able to do something with the fact that:
C = A % B
is equivalent to C = A – B * (A / B)
Using the fact that integer division is the same as multiplication by its reciprocal.
Thus converting the formula to C = A - B * (A * rB)
Note that for integer division the reciprocals are magic numbers, see: http://www.hackersdelight.org/magic.htm
C code is here: http://web.archive.org/web/20070713211039/http://hackersdelight.org/HDcode/magic.c
[FNV hashes]
See: http://www.isthe.com/chongo/tech/comp/fnv/#FNV-1a
hash = offset_basis
for each byte to be hashed
hash = hash xor octet_of_data
hash = hash * FNV_prime (for 32 bits = 16777619)
return hash
For 4 pointers truncated to 32 bits = 16 bytes the FNV hash takes 27 cycles (hand crafted assembly)
Unfortunately this leads to hash collisions of 81% where it should be 37%.
Running the full 15 multiplications takes 179 cycles.
Minimizing Collisions One of the main things you want to avoid in a hashed collection is collisions. This is when two or more keys map to the same bucket. These collisions mean you have to do more work to check whether the key is the one you expected, as there are now multiple keys in the same bucket.
In general, a hash function should depend on every single bit of the key, so that two keys that differ in only one bit or one group of bits (regardless of whether the group is at the beginning, end, or middle of the key or present throughout the key) hash into different values.
Hashing as we know it is used for performance improvement, error checking, and authentication. One example of a performance improvement is the common hash table, which uses a hash function to index into the correct bucket in the hash table, followed by comparing each element in the bucket to find a match.
Separate chaining If two records are being directed to the same cell, both would go into that cell as a linked list. This efficiently prevents a hash collision from occurring since records with the same hash values can go into the same cell, but it has its disadvantages.
Replacing modulus by reciprocal multiplication
The main cycle eater in this hash function is the modulus operator.
If you replace this division with a multiplication by the reciprocal the calculation is much faster.
Note that calculating the reciprocal involves 3 divides, so this should only be done when the reciprocal can be reused enough times.
OK, here's the code used: http://www.agner.org/optimize/asmlib.zip
From: http://www.agner.org/optimize/
// ;************************* divfixedi64.asm *********************************
// ; Author: Agner Fog
//extern "C" void setdivisoru32(uint Buffer[2], uint d)
asm
mov r8d, edx // x
mov r9, rcx // Buffer
dec r8d // r8d = r8d or esi
mov ecx, -1 // value for bsr if r8d = 0
bsr ecx, r8d // floor(log2(d-1))
inc r8d
inc ecx // L = ceil(log2(d))
mov edx, 1
shl rdx, cl // 2^L (64 bit shift because cl may be 32)
sub edx, r8d
xor eax, eax
div r8d
inc eax
mov [r9], eax // multiplier
sub ecx, 1
setae dl
movzx edx, dl // shift1
seta al
neg al
and al,cl
movzx eax, al // shift 2
shl eax, 8
or eax, edx
mov [r9+4], eax // shift 1 and shift 2
ret
end;
and the code for the modulus operation:
//extern "C" uint modFixedU32(uint Buffer[2], uint d)
asm
mov eax, edx
mov r10d, edx // x
mov r11d, edx // save x
mul dword [rcx] // Buffer (i.e.: m')
sub r10d, edx // x-t
mov ecx, [rcx+4] // shift 1 and shift 2
shr r10d, cl
lea eax, [r10+rdx]
mov cl, ch
shr eax, cl
// Result:= x - m * fastDiv32.dividefixedu32(Buffer, x);
mul r8d // m * ...
sub r11d, eax // x - (m * ...)
mov eax,r11d
ret
end;
The difference in time is as follows:
hashprime classic hash (mod) new hash new old
(# of runs) cycles/run per run (no cache) (no cache)
--------------------------------------------------------------------
4049 56 21 16.6 51
16217 68 not measured
64871 127 89 16.5 50
Cache issues
The increase in cycle time is caused by the data overflowing the cache, causing main memory to be accessed.
This can be seen clearly when I remove cache effects by hashing the same value over and over.
Something like this might be useful:
static inline unsigned int hash4(unsigned int a, unsigned int b,
unsigned int c, unsigned int d) {
unsigned long long foo = 123456789*(long long)a ^ 243956871*(long long)b
^ 918273645*(long long)c ^ 347562981*(long long)d;
return (unsigned int)(foo >> 32);
}
Replace the four odd numbers I typed in with randomly generated 64-bit odd numbers; the ones above won't work that great. (64-bit so that the high 32 bits are somehow a random mix of the lower bits.) This is about as fast as the code you gave, but it lets you use power-of-two table sizes instead of prime table sizes without fear.
The thing everyone uses for similar workloads is the FNV hash. I'm not sure whether FNV actually has better properties than hashes of the type above, but it's similarly fast and it's in rather widespread use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With