Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SSE code for sum bytes. Where's the bug?

I wrote SSE code for summing up byte values. (VS2005.)

As it is simple enough it works quite well (and fast). Only there are crashes with some sizes of the array. And it crashes only in release mode - in debug never. Maybe someone sees the "obvious" bug? Any help appreciated.

__int64 Sum (const unsigned char* pData, const unsigned int& nLength)
{
    __int64 nSum (0);

    __m128i* pp = (__m128i*)pData;

    ATLASSERT( ( (DWORD)pp & 15 ) == 0 ); // pointer must point to address multiple of 16 (cache line)

    __m128i zero = _mm_setzero_si128(),
        a, b, c, d, tmp;

    unsigned int i (0);

    for ( ; i < nLength; i+=64) // 4-fach loop-unroll (x 16)
    {
        a = _mm_sad_epu8( *(pp++), zero);           
        b = _mm_sad_epu8( *(pp++), zero);  // It crashes here.
        c = _mm_sad_epu8( *(pp++), zero);
        d = _mm_sad_epu8( *(pp++), zero);

        // commenting the following line prevents the crash (???)
        tmp = _mm_add_epi64( _mm_add_epi64( _mm_add_epi64( a, b ), c ), d);

        a = _mm_srli_si128 ( tmp, 8 );

        nSum += _mm_cvtsi128_si32( a ) + _mm_cvtsi128_si32( tmp );
    }

    // ... the rest
    if (nLength % 64)
        for (i -= 64; i < nLength; i++)
            nSum += pData [i];

    return nSum;
}

The function is called like this:

unsigned int nLength = 3571653;  // One of the values that causes crash
unsigned char *pData = (unsigned char*) _aligned_malloc(nLength, 16);
Sum (pData,  nLength);
like image 860
User42 Avatar asked Mar 19 '13 21:03

User42


2 Answers

Your for loop needs to be defined as follows:

for ( ; i < (nLength - 63); i+=64)

Basically imagine you pass in an array with nLength 120. You are fine on the first run through. i is now equal to 64. i < 120 so you do another loop. Unfortunately you pass the end of the array before you reach 128 and you enter into undefined behaviour territory. This could manifest as an access violation (0xC0000005) which would cause you to crash.

Now take the example of nLength=128 which should run perfectly in your optimised loop with my suggested modification. First loop i is fine and i = 64. i is less than 65 so another loop is performed. i is now equal to 128 and loop exits. The outer loop also doesn't run because i == nLength. Job done :)

like image 152
Goz Avatar answered Sep 28 '22 02:09

Goz


As requested, here's what I had in mind when I said "4 accumulators in the loop and summing them after the loop".

__int64 Sum (const unsigned char* pData, const int& nLength)
{
    __int64 nSum (0);

    __m128i* pp = (__m128i*)pData;


    __m128i zero = _mm_setzero_si128(),
        a = _mm_setzero_si128(),
        b = _mm_setzero_si128(),
        c = _mm_setzero_si128(),
        d = _mm_setzero_si128(), tmp;

    int i (0);

    for ( ; i < (nLength - 63); i+=64)
    {
        a = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), a );
        b = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), b );
        c = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), c );
        d = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), d );
    }

    tmp = _mm_add_epi64( _mm_add_epi64( a, b ), _mm_add_epi64( c, d ));
    tmp = _mm_add_epi64( _mm_srli_si128( tmp, 8 ), tmp );
    nSum = (_mm_cvtsi128_si32( tmp ) & 0xFFFFFFFFULL) + 
               (((__int64)_mm_cvtsi128_si32( _mm_srli_si128( tmp, 4 ) )) << 32);

    // ... the rest
    for (; i < nLength; i++)
        nSum += pData [i];

    return nSum;
}
like image 23
harold Avatar answered Sep 28 '22 03:09

harold