Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Help with optimizing C# function via C and/or Assembly

I have this C# method which I'm trying to optimize:

// assume arrays are same dimensions
private void DoSomething(int[] bigArray1, int[] bigArray2)
{
    int data1;
    byte A1, B1, C1, D1;
    int data2;
    byte A2, B2, C2, D2;
    for (int i = 0; i < bigArray1.Length; i++)
    {
        data1 = bigArray1[i];
        data2 = bigArray2[i];

        A1 = (byte)(data1 >> 0);
        B1 = (byte)(data1 >> 8);
        C1 = (byte)(data1 >> 16);
        D1 = (byte)(data1 >> 24);

        A2 = (byte)(data2 >> 0);
        B2 = (byte)(data2 >> 8);
        C2 = (byte)(data2 >> 16);
        D2 = (byte)(data2 >> 24);

        A1 = A1 > A2 ? A1 : A2;
        B1 = B1 > B2 ? B1 : B2;
        C1 = C1 > C2 ? C1 : C2;
        D1 = D1 > D2 ? D1 : D2;

        bigArray1[i] = (A1 << 0) | (B1 << 8) | (C1 << 16) | (D1 << 24); 
    }
}

The function basically compares two int arrays. For each pair of matching elements, the method compares each individual byte value and takes the larger of the two. The element in the first array is then assigned a new int value constructed from the 4 largest byte values (irrespective of source).

I think I have optimized this method as much as possible in C# (probably I haven't, of course - suggestions on that score are welcome as well). My question is, is it worth it for me to move this method to an unmanaged C DLL? Would the resulting method execute faster (and how much faster), taking into account the overhead of marshalling my managed int arrays so they can be passed to the method?

If doing this would get me, say, a 10% speed improvement, then it would not be worth my time for sure. If it was 2 or 3 times faster, then I would probably have to do it.

Note: please, no "premature optimization" comments, thanks in advance. This is simply "optimization".

Update: I realized that my code sample didn't capture everything I'm trying to do in this function, so here is an updated version:

private void DoSomethingElse(int[] dest, int[] src, double pos, 
    double srcMultiplier)
{
    int rdr;
    byte destA, destB, destC, destD;
    double rem = pos - Math.Floor(pos);
    double recipRem = 1.0 - rem;
    byte srcA1, srcA2, srcB1, srcB2, srcC1, srcC2, srcD1, srcD2;
    for (int i = 0; i < src.Length; i++)
    {
        // get destination values
        rdr = dest[(int)pos + i];
        destA = (byte)(rdr >> 0);
        destB = (byte)(rdr >> 8);
        destC = (byte)(rdr >> 16);
        destD = (byte)(rdr >> 24);
        // get bracketing source values
        rdr = src[i];
        srcA1 = (byte)(rdr >> 0);
        srcB1 = (byte)(rdr >> 8);
        srcC1 = (byte)(rdr >> 16);
        srcD1 = (byte)(rdr >> 24);
        rdr = src[i + 1];
        srcA2 = (byte)(rdr >> 0);
        srcB2 = (byte)(rdr >> 8);
        srcC2 = (byte)(rdr >> 16);
        srcD2 = (byte)(rdr >> 24);
        // interpolate (simple linear) and multiply
        srcA1 = (byte)(((double)srcA1 * recipRem) + 
            ((double)srcA2 * rem) * srcMultiplier);
        srcB1 = (byte)(((double)srcB1 * recipRem) +
            ((double)srcB2 * rem) * srcMultiplier);
        srcC1 = (byte)(((double)srcC1 * recipRem) +
            ((double)srcC2 * rem) * srcMultiplier);
        srcD1 = (byte)(((double)srcD1 * recipRem) +
            ((double)srcD2 * rem) * srcMultiplier);
        // bytewise best-of
        destA = srcA1 > destA ? srcA1 : destA;
        destB = srcB1 > destB ? srcB1 : destB;
        destC = srcC1 > destC ? srcC1 : destC;
        destD = srcD1 > destD ? srcD1 : destD;
        // convert bytes back to int
        dest[i] = (destA << 0) | (destB << 8) |
            (destC << 16) | (destD << 24);
    }
}

Essentially this does the same thing as the first method, except in this one the second array (src) is always smaller than the first (dest), and the second array is positioned fractionally relative to the first (meaning that instead of being position at, say, 10 relative to dest, it can be positioned at 10.682791).

To achieve this, I have to interpolate between two bracketing values in the source (say, 10 and 11 in the above example, for the first element) and then compare the interpolated bytes with the destination bytes.

I suspect here that the multiplication involved in this function is substantially more costly than the byte comparisons, so that part may be a red herring (sorry). Also, even if the comparisons are still somewhat expensive relative to the multiplications, I still have the problem that this system can actually be multi-dimensional, meaning that instead of comparing 1-dimensional arrays, the arrays could be 2-, 5- or whatever-dimensional, so that eventually the time taken to calculate interpolated values would dwarf the time taken by the final bytewise comparison of 4 bytes (I'm assuming that's the case).

How expensive is the multiplication here relative to the bit-shifting, and is this the kind of operation that could be sped up by being offloaded to a C DLL (or even an assembly DLL, although I'd have to hire somebody to create that for me)?

like image 591
MusiGenesis Avatar asked May 30 '10 13:05

MusiGenesis


4 Answers

Yes, the _mm_max_epu8() intrinsic does what you want. Chews through 16 bytes at a time. The pain-point is the arrays. SSE2 instructions require their arguments to be aligned at 16-byte addresses. You cannot get that out of the garbage collected heap, it only promises 4-byte alignment. Even if you trick it by calculating an offset in the array that's 16-byte aligned then you'll lose when the garbage collector kicks in and moves the array.

You'll have to declare the arrays in the C/C++ code, using the __declspec(align(#)) declarator. Now you need to copy your managed arrays into those unmanaged ones. And the results back. Whether you are still ahead depends on details not easily seen in your question.

like image 93
Hans Passant Avatar answered Oct 11 '22 02:10

Hans Passant


The function below uses unsafe code to treat the integer arrays as arrays of bytes so that there's no need for bit twiddling.

    private static void DoOtherThing(int[] bigArray1, int[] bigArray2)
    {
        unsafe
        {
            fixed (int* p1 = bigArray1, p2=bigArray2)
            {
                byte* b1 = (byte*)p1;
                byte* b2 = (byte*)p2;
                byte* bend = (byte*)(&p1[bigArray1.Length]);
                while (b1 < bend)
                {
                    if (*b1 < *b2)
                    {
                        *b1 = *b2;
                    }
                    ++b1;
                    ++b2;
                }
            }
        }
    }

On my machine running under the debugger in Release mode against arrays of 25 million ints, this code is about 29% faster than your original. However, running standalone, there is almost no difference in runtime. Sometimes your original code is faster, and sometimes the new code is faster.

Approximate numbers:

          Debugger  Standalone
Original  1,400 ms    700 ms
My code     975 ms    700 ms

And, yes, I did compare the results to ensure that the functions do the same thing.

I'm at a loss to explain why my code isn't faster, since it's doing significantly less work.

Given these results, I doubt that you could improve things by going to native code. As you say, the overhead of marshaling the arrays would likely eat up any savings you might realize in the processing.

The following modification to your original code, though, is 10% to 20% faster.

    private static void DoSomething(int[] bigArray1, int[] bigArray2)
    {
        for (int i = 0; i < bigArray1.Length; i++)
        {
            var data1 = (uint)bigArray1[i];
            var data2 = (uint)bigArray2[i];

            var A1 = data1 & 0xff;
            var B1 = data1 & 0xff00;
            var C1 = data1 & 0xff0000;
            var D1 = data1 & 0xff000000;

            var A2 = data2 & 0xff;
            var B2 = data2 & 0xff00;
            var C2 = data2 & 0xff0000;
            var D2 = data2 & 0xff000000;

            if (A2 > A1) A1 = A2;
            if (B2 > B1) B1 = B2;
            if (C2 > C1) C1 = C2;
            if (D2 > D1) D1 = D2;

            bigArray1[i] = (int)(A1 | B1 | C1 | D1);
        }
    }
like image 41
Jim Mischel Avatar answered Oct 11 '22 03:10

Jim Mischel


What about this?

    private void DoSomething(int[] bigArray1, int[] bigArray2)
    {
        for (int i = 0; i < bigArray1.Length; i++)
        {
            var data1 = (uint)bigArray1[i];
            var data2 = (uint)bigArray2[i];

            bigArray1[i] = (int)(
                Math.Max(data1 & 0x000000FF, data2 & 0x000000FF) |
                Math.Max(data1 & 0x0000FF00, data2 & 0x0000FF00) |
                Math.Max(data1 & 0x00FF0000, data2 & 0x00FF0000) |
                Math.Max(data1 & 0xFF000000, data2 & 0xFF000000));
        }
    }

It has a lot less bit shifting in it. You might find the calls to Math.Max aren't inlined if you profile it. In such a case, you'd just make the method more verbose.

I haven't tested this code as I don't have an IDE with me. I reckon it does what you want though.

If this still doesn't perform as you'd expect, you could try using pointer arithmetic in an unsafe block, but I seriously doubt that you'd see a gain. Code like this is unlikely to be faster if you extern to it, from everything I've read. But don't take my word for it. Measure, measure, measure.

Good luck.

like image 41
Drew Noakes Avatar answered Oct 11 '22 04:10

Drew Noakes


I don't see any way of speeding up this code by means of clever bit tricks.

If you really want this code to be faster, the only way of significantly (>2x or so) speeding it up on x86 platform I see is to go for assembler/intrinsics implementation. SSE has the instruction PCMPGTB that

"Performs a SIMD compare for the greater value of the packed bytes, words, or doublewords in the destination operand (first operand) and the source operand (second operand). If a data element in the destination operand is greater than the corresponding date element in the source operand, the corresponding data element in the destination operand is set to all 1s; otherwise, it is set to all 0s."

XMM register would fit four 32-bit ints, and you could loop over your arrays reading the values, getting the mask and then ANDing the first input with the mask and the second one with inverted mask.

On the other hand, maybe you can reformulate your algorithm so that you don't need to to pick larger bytes, but maybe for example take AND of the operands? Just a thought, hard to see if it can work without seeing the actual algorithm.

like image 29
Krystian Avatar answered Oct 11 '22 02:10

Krystian