So, I am looking at different hash functions to use for hashing a 4 tuple ip and port to identify flows.
One I came across was
((size_t)(key.src.s_addr) * 59) ^
((size_t)(key.dst.s_addr)) ^
((size_t)(key.sport) << 16) ^
((size_t)(key.dport)) ^
((size_t)(key.proto));
Now for the life of me, I cannot explain the prime used (59). why not 31, and then why go mess it up by multiplying the sport by a power of 2. Is there a better hash function to be used for ip addresses ?
Personally I think you'd be better off reading the four IP bytes as an unsigned long which would give you a number roughly in the range 0 - 2^32-1. Then you figure out how many flows you want to have active at any one time and that would be your index table size.
Take 2000 for example. That means you want to map 2^32 numbers onto roughly 2^11 indeces (to flow information). That won't work because hashing almost never works if filled to 100% and even 90% can be difficult. Using a index table that you only fill to 50% (4000 indeces) or even 25% (8000) is no big deal with todays memories.
The exact size of the index table should be an uneven number of locations and preferably a prime number. This is because you'll most likely need to have some overflow handling to handle collisions (two or more ip numbers which after the hashing point to the same location in the index table) - which you'll get. The overflow handling should be another prime number less than the index table size. All these prime numbers! What's with them anyway?
I'll illustrate with an example (in C):
idxsiz = prime(2000 * 2); // 50% loading
ovfjmp = prime(idxsiz/3);
...
initially fill the table of idxjmp positions with an UNUSED marking (-1). Have a DELETED marking ready (-2).
Your ip number enters the system and you look for its flow record (may or may not exist):
stoppos = ip % idxsiz; /* modulo (division) just this once */
i = stoppos;
do
{
if (index[i] == UNUSED) return NULL;
if (index[i] != DELETED)
{
flowrecptr = &flow_record[index[i]];
if (!flowrecptr->in_use) {/* hash table is broken */}
if (flowrecptr->ip == ip) return flowrecptr;
}
i += ovfjmp;
if (i >= idxsiz) i -= idxsiz;
}
while (i != stoppos);
return NULL;
The UNUSED serves as a marker that this index has never been used and that searching should stop. The DELETED serves as a marker that this index has been used but no longer. That means that searching must continue.
This was when you were attempting to do a get. You got a NULL back from get so you need to do a put which you begin by finding the first index position containing UNUSED or DELETED. Replace this value with an index to the first/next free row on the flow_record table. Mark the row as in_use. Put the original ip number into the ip member of the flow_record row.
This is a very simple - but very effective - way to construct a hashing mechanism. Practically every optimization in the form of special functions to be used after this or that function has failed will enhance the effectiveness of the hashing.
The use of prime numbers will ensure that - in the worst case where all index locations are occupied - the mechanism will test every single location. To illustrate this: suppose idxsiz is evenly divisible by ovfjmp: you won't have much overflow handling to speak of. 35 and 7 will result in locations 0,7,14,21 and 28 being tested before the index jumps to 0 where the while test will cause the search to stop.
----------------------OOPS!
I missed that you wanted the port number as well. Assuming ip V4 that means 6 bytes of address. Read this as an unsigned 64-bit integer and clear the top 16 bits/2 bytes. Then you do the modulo calculation.
Brian Gideon pretty much sums it up; the multiplication and the shift are intended as a symmetry breaker. So this catches the hypothetical case of machine A telnetting to machine B and vice versa and they happen to chose the same ephemeral portnum. Not very common, but not impossible. Much of the 5-tuple is pretty constant: protocol comes from a very small domain, and so does one half of {address,portnum}.
Assuming a prime-sized hashtable, the magic constant 59 could be replaced by any prime, IMO. The (port << 16) could also be replaced by another shift, as long as no bits fall off or even by a (port * some_other_prime) term.
For a power-of-two sized hashtable, all(minus one) members of the 5-tuple should be multiplied by a (different) prime. (in the old days, when division was expensive that would have been an option)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With