Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.

h(abcdef) = h(bcdefa) = h(cdefab) etc

Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.

I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?

It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...

like image 558
Phil H Avatar asked Apr 06 '10 13:04

Phil H


2 Answers

I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".

So:

"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)
like image 200
Jon Skeet Avatar answered Sep 25 '22 08:09

Jon Skeet


Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.

I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.

static int Hash(string s)
{
    int H = 0;

    if (s.Length > 0)
    {
        //any arbitrary coprime numbers
        int a = s.Length, b = s.Length + 1;

        //an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
        int[] c = new int[0xFF];

        for (int i = 1; i < c.Length; i++)
        {
            c[i] = i + 1;
        }

        Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
        Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();

        //for i=0 we need to wrap around to the last character
        H = NextPair(s[s.Length - 1], s[0]);

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= NextPair(s[i - 1], s[i]);
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine("{0:X8}", Hash("abcdef"));
    Console.WriteLine("{0:X8}", Hash("bcdefa"));
    Console.WriteLine("{0:X8}", Hash("cdefab"));
    Console.WriteLine("{0:X8}", Hash("cdfeab"));
    Console.WriteLine("{0:X8}", Hash("a0a0"));
    Console.WriteLine("{0:X8}", Hash("1010"));
    Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
    Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}

The output is now:

7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71

First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:

static int Hash(char[] s)
{
    //any arbitrary coprime numbers
    const int a = 7, b = 13;

    int H = 0;

    if (s.Length > 0)
    {
        //for i=0 we need to wrap around to the last character
        H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine(Hash("abcdef".ToCharArray()));
    Console.WriteLine(Hash("bcdefa".ToCharArray()));
    Console.WriteLine(Hash("cdefab".ToCharArray()));
    Console.WriteLine(Hash("cdfeab".ToCharArray()));
}

The output is:

4587590
4587590
4587590
7077996
like image 22
Michael Petito Avatar answered Sep 25 '22 08:09

Michael Petito