I wrote SSE code for summing up byte values. (VS2005.)
As it is simple enough it works quite well (and fast). Only there are crashes with some sizes of the array. And it crashes only in release mode - in debug never. Maybe someone sees the "obvious" bug? Any help appreciated.
__int64 Sum (const unsigned char* pData, const unsigned int& nLength)
{
__int64 nSum (0);
__m128i* pp = (__m128i*)pData;
ATLASSERT( ( (DWORD)pp & 15 ) == 0 ); // pointer must point to address multiple of 16 (cache line)
__m128i zero = _mm_setzero_si128(),
a, b, c, d, tmp;
unsigned int i (0);
for ( ; i < nLength; i+=64) // 4-fach loop-unroll (x 16)
{
a = _mm_sad_epu8( *(pp++), zero);
b = _mm_sad_epu8( *(pp++), zero); // It crashes here.
c = _mm_sad_epu8( *(pp++), zero);
d = _mm_sad_epu8( *(pp++), zero);
// commenting the following line prevents the crash (???)
tmp = _mm_add_epi64( _mm_add_epi64( _mm_add_epi64( a, b ), c ), d);
a = _mm_srli_si128 ( tmp, 8 );
nSum += _mm_cvtsi128_si32( a ) + _mm_cvtsi128_si32( tmp );
}
// ... the rest
if (nLength % 64)
for (i -= 64; i < nLength; i++)
nSum += pData [i];
return nSum;
}
The function is called like this:
unsigned int nLength = 3571653; // One of the values that causes crash
unsigned char *pData = (unsigned char*) _aligned_malloc(nLength, 16);
Sum (pData, nLength);
Your for loop needs to be defined as follows:
for ( ; i < (nLength - 63); i+=64)
Basically imagine you pass in an array with nLength 120. You are fine on the first run through. i is now equal to 64. i < 120 so you do another loop. Unfortunately you pass the end of the array before you reach 128 and you enter into undefined behaviour territory. This could manifest as an access violation (0xC0000005) which would cause you to crash.
Now take the example of nLength=128 which should run perfectly in your optimised loop with my suggested modification. First loop i is fine and i = 64. i is less than 65 so another loop is performed. i is now equal to 128 and loop exits. The outer loop also doesn't run because i == nLength. Job done :)
As requested, here's what I had in mind when I said "4 accumulators in the loop and summing them after the loop".
__int64 Sum (const unsigned char* pData, const int& nLength)
{
__int64 nSum (0);
__m128i* pp = (__m128i*)pData;
__m128i zero = _mm_setzero_si128(),
a = _mm_setzero_si128(),
b = _mm_setzero_si128(),
c = _mm_setzero_si128(),
d = _mm_setzero_si128(), tmp;
int i (0);
for ( ; i < (nLength - 63); i+=64)
{
a = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), a );
b = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), b );
c = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), c );
d = _mm_add_epi64( _mm_sad_epu8( *(pp++), zero ), d );
}
tmp = _mm_add_epi64( _mm_add_epi64( a, b ), _mm_add_epi64( c, d ));
tmp = _mm_add_epi64( _mm_srli_si128( tmp, 8 ), tmp );
nSum = (_mm_cvtsi128_si32( tmp ) & 0xFFFFFFFFULL) +
(((__int64)_mm_cvtsi128_si32( _mm_srli_si128( tmp, 4 ) )) << 32);
// ... the rest
for (; i < nLength; i++)
nSum += pData [i];
return nSum;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With