Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good hash function for list of integers where order doesn't change value

Tags:

c#

Given a set of integers, a "functional group", is there a better way to GetHashCode of the integers where the positions of the numbers doesn't affect the hash?

void Main()
{
    int[] ints = { 10001, 10002, 10003, 10004, 10005 };

    int hash = GetHashCode(ints);

    Console.WriteLine("hash={0}", hash);
}

int GetHashCode(IEnumerable<int> integers)
{
    IEnumerator<int> intEnum = integers.GetEnumerator();

    if(intEnum.MoveNext()==false) return 0;

    int hash = 0;
    unchecked {
        hash = intEnum.Current.GetHashCode();
        for(;intEnum.MoveNext()==true;)
            hash = 31 * hash + intEnum.Current.GetHashCode();
    }

    return hash;
}

Output of this is: hash=954101523 If I swap 10003 and 10002 i get: hash=954130353

Besides sorting the list before getting the hash, is there a better alternative that wont change if the items in the list positions change?

The list of integers basically represents a set of record Ids that are a "functional group", so the "functional group" is really the key, and not really dependent on the order

like image 249
enorl76 Avatar asked Feb 04 '15 16:02

enorl76


People also ask

What is a good hash function for integers?

A good hash function to use with integer key values is the mid-square method. The mid-square method squares the key value, and then takes out the middle r bits of the result, giving a value in the range 0 to 2r−1. This works well because most or all bits of the key value contribute to the result.

What makes a good hash function?

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability.

Can hash values be the same?

At the same time, two keys can also generate an identical hash. This phenomenon is called a collision. A good hash function never produces the same hash value from two different inputs.

Can lists be hashed?

Companies upload a coded list of phone numbers or email addresses to optimize their ad delivery. These coded lists are called hashed lists.


2 Answers

Addition with a good one-value hash function

One good one-value hash function has a public domain implementation in C thanks to the Hash Function Prospector:

// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
    x ^= x >> 17;
    x *= UINT32_C(0xed5ad4bb);
    x ^= x >> 11;
    x *= UINT32_C(0xac4c1b51);
    x ^= x >> 15;
    x *= UINT32_C(0x31848bab);
    x ^= x >> 14;
    return x;
}

You would convert that to C#, apply that to each value, and then sum all the hashed results together. Addition perfectly satisfies your 'order doesn't matter' criterion since order doesn't matter with addition, you still get the same result. The one-value hash function above satisfies your desire for a decent hash function.

Implementation

The following implements the idea above (with a test rearrangement to show it gives the same hash value):

using System;
using System.Collections.Generic;

public class Test
{
    static void Main()
    {
        int[] ints = { 10001, 10002, 10003, 10004, 10005 };
        int hash = GetHashCode(ints);
        int[] reorderedInts = { 10004, 10002, 10005, 10001, 10003 };
        int reorderedHash = GetHashCode(reorderedInts);

        Console.WriteLine("hash          == {0}", hash);
        Console.WriteLine("hashReordered == {0}", reorderedHash);
    }

    static int GetHashCode(IEnumerable<int> integers)
    {
        int hash = 0;

        foreach(int integer in integers)
        {
            int x = integer;

            x ^= x >> 17;
            x *= 830770091;   // 0xed5ad4bb
            x ^= x >> 11;
            x *= -1404298415; // 0xac4c1b51
            x ^= x >> 15;
            x *= 830770091;   // 0x31848bab
            x ^= x >> 14;

            hash += x;
        }

        return hash;
    }
}

This produces the output:

hash          == -2145263134
hashReordered == -2145263134
like image 117
Chai T. Rex Avatar answered Oct 17 '22 23:10

Chai T. Rex


Instead of finding a permutation-invariant hash, I propose you first "un-permute" the list by finding a canonical permutation (e.g. sort the list first), then hashing that with whatever hash you desire.

Note that since this is integers we're talking about, you can use radix sort to do it in linear time.

like image 21
user541686 Avatar answered Oct 18 '22 01:10

user541686