Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

signed to positive near-perfect hash

I have an integer type, say long, whose values are between Long.MIN_VALUE = 0x80...0 (-2^63) and Long.MAX_VALUE = 0x7f...f (2^63 - 1). I want to hash it with ~50% collision to a positive integer of the same type (i.e. between 1 and Long.MAX_VALUE) in a clean and efficient manner.

My first attempts were something like:

  • Math.abs(x) + 1
  • (x & Long.MAX_VALUE) + 1

but those and similar approaches always have problems with certain values, i.e. when x is 0 / Long.MIN_VALUE / Long.MAX_VALUE. Of course, the naive solution is to use 2 if statements, but I'm looking for something cleaner / shorter / faster. Any ideas?

Note: Assume that I'm working in Java where there is no implicit conversion to boolean and shift semantics is defined.

like image 805
eold Avatar asked Jul 11 '12 05:07

eold


2 Answers

The simplest approach is to zero the sign bit and then map zero to some other value:

Long y = x & Long.MAX_VALUE;
return (y == 0)? 42: y;

This is simple, uses only one if/ternary operator, and gives ~50% collision rate on average. There is one disadvantage: it maps 4 different values (0, 42, MIN_VALUE, MIN_VALUE+42) to one value (42). So for this value we have 75% collisions, while for other values - exactly 50%.

It may be preferable to distribute collisions more evenly:

return (x == 0)? 42: (x == Long.MIN_VALUE) ? 142: x & Long.MAX_VALUE;

This code gives 67% collisions for 2 values and 50% for other values. You cannot distribute collisions more evenly, but it is possible to choose these 2 most colliding values. Disadvantage is that this code uses two ifs/ternary operators.

It is possible to avoid 75% collisions on single value while using only one if/ternary operator:

Long y = x & Long.MAX_VALUE;
return (y == 0)? 42 - (x >> 7): y;

This code gives 67% collisions for 2 values and 50% collisions for other values. There is less freedom choosing these most colliding values: 0 maps to 42 (and you can choose almost any value instead); MIN_VALUE maps to 42 - (MIN_VALUE >> 7) (and you can shift MIN_VALUE by any value from 1 to 63, only make sure that A - (MIN_VALUE >> B) does not overflow).


It is possible to get the same result (67% collisions for 2 values and 50% collisions for other values) without conditional operators (but with more complicated code):

Long y = x - 1 - ((x >> 63) << 1);
Long z = y + 1 + (y >> 63);
return z & Long.MAX_VALUE;

This gives 67% collisions for values '1' and 'MAX_VALUE'. If it is more convenient to get most collisions for some other values, just apply this algorithm to x + A, where 'A' is any number.

An improved variant of this solution:

Long y = x + 1 + ((x >> 63) << 1);
Long z = y - (y >> 63);
return z & Long.MAX_VALUE;
like image 100
Evgeny Kluev Avatar answered Sep 18 '22 07:09

Evgeny Kluev


Assuming you want to collapse all values into the positive space, why not just zero the sign bit?

You can do this with a single bitwise op by taking advantage of the fact that MAX_VALUE is just a zero sign bit followed by ones e.g.

int positive = value & Integer.MAX_VALUE;

Or for longs:

long positive = value & Long.MAX_VALUE;

If you want a "better" hash with pseudo-random qualities, you probably want to pss the value through another hash function first. My favourite fast hashes are the XORshift family by George Marsaglia. These have the nice property that they map the entire int / long number space perfectly onto itself, so you will still get exactly 50% collisions after zeroing the sign bit.

Here's a quick XORshift implementation in Java:

public static final long xorShift64(long a) {
    a ^= (a << 21);
    a ^= (a >>> 35);
    a ^= (a << 4);
    return a;
}

public static final int xorShift32(int a) {
    a ^= (a << 13);
    a ^= (a >>> 17);
    a ^= (a << 5);
    return a;
}
like image 28
mikera Avatar answered Sep 22 '22 07:09

mikera