Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash integer array

I am using a hash set in which I store array of integers(32 bits). This means I need an algorithm to hash an array of integers. I am looking for a 32 bits integer(C# int) hash.

I have tried and edited two existing algorithms, which you can see the four versions of at the bottom, including their benchmark.

My questions are as follows:

1. Do you think the bottom algorithm is good for this purpose?

2. Is there a better algorithm available for this purpose?

Program information

  • Typically an array has 16 entries, and the integers are smaller than 10, although both must support larger values. I could say the largest values that have any chance on occurring are 200 entries and integers of value 20.
  • I am using the HashSet in a breath-first search algorithm, to compare if two nodes are the same. http://en.wikipedia.org/wiki/Breadth-first_search.
  • For this specific program I am unable to use unsafe code.

Benchmarks and code

below are my benchmarks and code, from worst to best performance in my program.

  • Coordinates2D is a struct containing an int x and an int y.
  • The total entries in the HashSet at run end is 356525
  • I could not retrieve the number of collisions exactly. The number given is the number of times an object was actually compared and not equal(same hash, different object). This happens multiple times between the same objects though. This value varies a bit per execution, as the program is multithreaded.
  • MurMurHash3 seed is const uint seed = 144

MurMurHash3 using bytes retrieved from the coordinates directly

Code is equal to https://gist.github.com/automatonic/3725443 The array of bytes is retrieved using the following code:

int size = Marshal.SizeOf(typeof(Coordinates2D));
int length = carCoords.Length;
Byte[] bytes = new Byte[size * length];
for (int i = 0; i < length; ++i)
{
    GCHandle pinStructure = GCHandle.Alloc(carCoords[i], GCHandleType.Pinned);
    Marshal.Copy(pinStructure.AddrOfPinnedObject(), bytes, i*size, size);
    pinStructure.Free();
}

// Hash the byte array
return MurMurHash3.Hash(new System.IO.MemoryStream(bytes));

This is incredibly inefficient, because of the copying.

  • Performance: 40880ms
  • Collisions: < 84

MurMurHash3 using bytes retrieved from the integers in the objects

public static int Hash2(RushHourPathLengthNode.Coordinates2D[] coords)
{
    const uint c1 = 0xcc9e2d51;
    const uint c2 = 0x1b873593;

    uint h1 = seed;
    uint k1 = 0;
    uint streamLength = (uint)coords.Length * 2;

    for (int i = 0, l = coords.Length; i < l; ++i)
    {
        // Do it for X
        byte[] chunk = BitConverter.GetBytes(coords[i].x);

        /* Get four bytes from the input into an uint */
        k1 = (uint)
           (chunk[0]
          | chunk[1] << 8
          | chunk[2] << 16
          | chunk[3] << 24);

        /* bitmagic hash */
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;


        // Do it for y
        chunk = BitConverter.GetBytes(coords[i].y);

        /* Get four bytes from the input into an uint */
        k1 = (uint)
           (chunk[0]
          | chunk[1] << 8
          | chunk[2] << 16
          | chunk[3] << 24);

        /* bitmagic hash */
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // finalization, magic chants to wrap it all up
    h1 ^= streamLength;
    h1 = fmix(h1);

    unchecked //ignore overflow
    {
        return (int)h1;
    }
}

This is alot more efficient now the copying is gone.

  • Performance: 16640ms
  • Collisions: < 92

MurMurHash3 using integers

public static int Hash(RushHourPathLengthNode.Coordinates2D[] coords)
{
    const uint c1 = 0xcc9e2d51;
    const uint c2 = 0x1b873593;

    uint h1 = seed;
    uint k1 = 0;
    uint streamLength = (uint)coords.Length * 2;

    for (int i = 0, l = coords.Length; i < l; ++i)
    {
        k1 = (uint)coords[i].x;

        //bitmagic hash
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;

        k1 = (uint)coords[i].y;

        //bitmagic hash
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // finalization, magic chants to wrap it all up
    h1 ^= streamLength;
    h1 = fmix(h1);

    unchecked //ignore overflow
    {
        return (int)h1;
    }
}
  • Performance: 13027ms
  • Collisions: < 95

Hash using integer addition multiplication

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash * 31 + carCoords[i].x;
    hash = hash * 31 + carCoords[i].y;
}
return hash;
  • Performance: 4564ms
  • Collisions: < 44

As you see, this one is far more efficient. It works well with any prime numbers. As I understand, there is no scientific proof of this to work, which I am not too fond of.

According to Michal B. a faster version would be using bitshifting. However, testing shows that this is not a successful hash. The problem takes significantly longer to run(It did not finish within 5 minutes). The bitshifting might be good, but it seems like the 31(prime number) is crucial.

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash << 5 - carCoords[i].x;
    hash = hash << 5 - carCoords[i].y;
}
return hash;
like image 1000
Aart Stuurman Avatar asked Nov 08 '13 08:11

Aart Stuurman


2 Answers

In the end I went with the last algorithm.

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash * 19 + carCoords[i].x;
    hash = hash * 19 + carCoords[i].y;
}
return hash;

This is very fast to compute, and for the (small) numbers I am using the hash is awesome.

If you are going to use this, make sure the numbers you use are prime numbers. Because of this you cannot use bitshifting to optimize it.

like image 121
Aart Stuurman Avatar answered Nov 09 '22 21:11

Aart Stuurman


Have you considered using a space-filling curve to generate the hash? This will minimize (or eliminate) collisions for the chosen resolution (maxX, maxY)

Here are two SO questions and their answers that use this method.

  1. Mapping N-dimensional value to a point on Hilbert curve
  2. Calculate the Hilbert value of a point for use in a Hilbert R-Tree?

Hope this helps!

like image 32
Ani Avatar answered Nov 09 '22 19:11

Ani