Faster String GetHashCode (e.g. using Multicore or GPU)

Question

According to http://www.codeguru.com/forum/showthread.php?t=463663 , C#'s getHashCode function in 3.5 is implemented as:

public override unsafe int GetHashCode()
{
    fixed (char* str = ((char*) this))
    {
        char* chPtr = str;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
                break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}

I am curious if anyone can come up with a function which returns the same results, but is faster. It is OK to increase the overall starting and resource overhead of the main application. Requiring a one-time initialization (per application execution, not per call or per string) is OK.

Note that unlike Microsoft, considerations like, "doing it this way will make everything else slower and has costs that make this method stupid!" can be ignored, so it is possible that even assuming Microsoft's is perfect, it can be beaten by doing something "stupid."

This purely an exercise in my own curiosity and will not be used in real code.

Examples of ideas I've thought of:

Using multiple cores (calculating num2 and num independently)
Using the gpu

Ira Baxter · Accepted Answer

One way to make a function go faster is to take special cases into account. A function with variable size inputs has special cases based on size.

Going parallel only makes sense when the the cost of going parallel is smaller than the gain, and for this kind of computation it is likely that the string would have to be fairly large to overcome the cost of forking a parallel thread. But implementing that isn't hard; basically you need a test for this.Length exceeding an empirically determined threshold, and then forking multiple threads to compute hashes on substrings, with a final step composing the subhashes into a final hash. Implementation left for the reader.

Modern processors also have SIMD instructions, which can process up to 32 (or 64) bytes in a single instruction. This would allow you to process the string in 32 (16 bit character) chunks in one-two SIMD instructions per chunk; and then fold the 64 byte result into a single hashcode at the end. This is likely to be extremely fast for strings of any reasonable size. The implementation of this from C# is harder, because one doesn't expect a virtual machine to provide provide easy (or portable) access to the SIMD instructions that you need. Implementation also left for the reader. EDIT: Another answer suggests that Mono system does provide SIMD instruction access.

Having said that, the particular implementation exhibited is pretty stupid. The key observation is that the loop checks the limit twice on every iteration. One can solve that problem by checking the end condition cases in advance, and executing a loop that does the correct number of iterations. One can do better than that by using Duffs device to jump into an unrolled loop of N iterations. This gets rid of the loop limit checking overhead for N-1 iterations. That modification would be very easy and surely be worth the effort to implement.

EDIT: You can also combine the SIMD idea and the loop unrolling idea to enable processing many chunks of 8/16 characters in a few SIMD instrucions.

For languages that can't jump into loops, one can do the equivalent of Duff's device by simply peeling off the initial cases. A shot at how to recode the original code using the loop peeling approach is the following:

    public override unsafe int GetHashCode()
    {
        fixed (char* str = ((char*) this))
        {
            const int N=3; // a power of two controlling number of loop iterations
            char* chPtr = str;
            int num = 0x15051505;
            int num2 = num;
            int* numPtr = (int*) chPtr;
            count = this.length;
            unrolled_iterations = count >> (N+1); // could be 0 and that's OK
            for (int i = unrolled_iterations; i > 0; i--)
            {
               // repeat 2**N times
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[8];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[9]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[10];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[11]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[12];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[13]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[14];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[15]; }
               numPtr += 16;
            }
            if (count & ((1<<N)-1))
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7]; }
               numPtr += 8;
            }
            if (count & ((1<<(N-1))-1))
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
               numPtr += 4;
            }
            if (count & ((1<<(N-2)-1))
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               numPtr += 2;
            }
            // repeat N times and finally:
            if { count & (1) }
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
               // numPtr += 1;
            }

            return (num + (num2 * 0x5d588b65));
        }
    }

I haven't compiled or tested this code, but the idea is right. It depends on the compiler doing reasonable constant folding and address arithmetic.

I tried to code this to preserve the exact hash value of the original, but IMHO that isn't really a requirement. It would be even simpler and a tiny bit faster if it didn't use the num/num2 stunt, but simply updated num for each character.

Corrected version (by Brian) as a static function:

    public static unsafe int GetHashCodeIra(string x)
    {
        fixed (char* str = x.ToCharArray())
        {
            const int N = 2; // a power of two controlling number of loop iterations
            char* chPtr = str;
            int num = 0x15051505;
            int num2 = num;
            int* numPtr = (int*)chPtr;
            int count = (x.Length+1) / 2;
            int unrolled_iterations = count >> (N+1); // could be 0 and that's OK
            for (int i = unrolled_iterations; i > 0; i--)
            {  // repeat 2**N times
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7];
                }
                numPtr += 8;
            }
            if (0 != (count & ((1 << N) )))
            {
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3];
                }
                numPtr += 4;
            }
            if (0 != (count & ((1 << (N - 1) ))))
            {
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                }
                numPtr += 2;
            }
            // repeat N times and finally:
            if (1 == (count & 1))
            {
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    // numPtr += 1;
                }
            }

            return (num + (num2 * 0x5d588b65));
        }
    }

elder_george · Answer

Threads and GPU most certainly will introduce overhead greater than possible performance boost. The approach that could be justified is using SIMD instruction sets, such as SSE. However, it would require testing whether this partcular instruction set is available, which may cost. It will also bring boost on long strings only.

If you want to try it, consider testing Mono support for SIMD before diving into C or assembly. Read here about development possibilities and gotchas.

Faster String GetHashCode (e.g. using Multicore or GPU)

Tags:

c#

.net

optimization

Brian

2 Answers

Ira Baxter

elder_george

Recent Activity

Donate For Us

Faster String GetHashCode (e.g. using Multicore or GPU)

Tags:

c#

.net

optimization

Brian

2 Answers

Ira Baxter

elder_george

Related questions

Recent Activity

Donate For Us