I am improving the performance of a program (C) and I can't obtain better execution time improving the most "expensive" loop. I have to substract 1 from each element of a unsigned long int array, if the element is greater than zero. The loop is: <pre class="prettyprint"><code>unsigned long int * WorkerDataTime; ... for (WorkerID=0;WorkerID<WorkersON;++WorkerID){ if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1; } </code></pre> And I try this: <pre class="prettyprint"><code>for (WorkerID=0;WorkerID<WorkersON;++WorkerID){ int rest = WorkerDataTime[WorkerID] > 0; WorkerDataTime[WorkerID] = WorkerDataTime[WorkerID] - rest; } </code></pre> But the execution time is similar. THE QUESTION: Is there any intrinsec instruction (SSE4.2, AVX...) to do this directly? (I'm using gcc 4.8.2) I know that is possible with char or short elements. (_mm_subs_epi8 and _mm_subs_epi16) and I can't use AVX2. Thank you.

With SSE4 it is possible using three instructions. Here is a code that processes an entire array, decrementing all unsigned integers that aren't zero: <pre class="prettyprint"><code>void clampedDecrement_SSE (__m128i * data, size_t count) { // processes 2 elements each, no checks for alignment done. // count must be multiple of 2. size_t i; count /= 2; __m128i zero = _mm_set1_epi32(0); __m128i ones = _mm_set1_epi32(~0); for (i=0; i<count; i++) { __m128i values, mask; // load 2 64 bit integers: values = _mm_load_si128 (data); // compare against zero. Gives either 0 or ~0 (on match) mask = _mm_cmpeq_epi64 (values, zero); // negate above mask. Yields -1 for all non zero elements, 0 otherwise: mask = _mm_xor_si128(mask, ones); // now just add the mask for saturated unsigned decrement operation: values = _mm_add_epi64(values, mask); // and store the result back to memory: _mm_store_si128(data,values); data++; } } </code></pre> With AVX2 we can improve upon this and process 4 elements at at time: <pre class="prettyprint"><code>void clampedDecrement (__m256i * data, size_t count) { // processes 4 elements each, no checks for alignment done. // count must be multiple of 4. size_t i; count /= 4; // we need some constants: __m256i zero = _mm256_set1_epi32(0); __m256i ones = _mm256_set1_epi32(~0); for (i=0; i<count; i++) { __m256i values, mask; // load 4 64 bit integers: values = _mm256_load_si256 (data); // compare against zero. Gives either 0 or ~0 (on match) mask = _mm256_cmpeq_epi64 (values, zero); // negate above mask. Yields -1 for all non zero elements, 0 otherwise: mask = _mm256_xor_si256(mask, ones); // now just add the mask for saturated unsigned decrement operation: values = _mm256_add_epi64(values, mask); // and store the result back to memory: _mm256_store_si256(data,values); data++; } } </code></pre> EDIT: added SSE code version.

Unless your CPU has XOP than there is no efficient way to compare unsigned 64-bit integers. I ripped the following from Agner Fog's Vector Class Library. This shows how to compare unsigned 64-bit integers. <pre class="prettyprint"><code>static inline Vec2qb operator > (Vec2uq const & a, Vec2uq const & b) { #ifdef __XOP__ // AMD XOP instruction set return Vec2q(_mm_comgt_epu64(a,b)); #else // SSE2 instruction set __m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword __m128i aflip = _mm_xor_si128(a,sign32); // a with sign bits flipped __m128i bflip = _mm_xor_si128(b,sign32); // b with sign bits flipped __m128i equal = _mm_cmpeq_epi32(a,b); // a == b, dwords __m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords __m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger __m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part bigger __m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part return Vec2qb(Vec2q(big)); #endif } </code></pre> So if you CPU supports XOP than you should try compiling with <code>-mxop</code> and see if the loop is vectorized. Edit: If GCC does not vectorize this like you want and your CPU has XOP you can do <pre class="prettyprint"><code>for (WorkerID=0; WorkerID<WorkersON-1; workerID+=2){ __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]); __m128i cmp = _mm_comgt_epu64(v, _mm_setzero_si128()); v = _mm_add_epi64(v,cmp); _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v); } for (;WorkerID<WorkersON;++WorkerID){ if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1; } </code></pre> Compile with <code>-mxop</code> and include <code>#include <x86intrin.h></code>. Edit: as Nils Pipenbrinck pointed out if you don't have XOP you can do this with one more instruction using <code>_mm_xor_si128</code>: <pre class="prettyprint"><code>for (WorkerID=0; WorkerID<WorkersON-1; WorkerID+=2){ __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]); __m128i mask = _mm_cmpeq_epi64(v,_mm_setzero_si128()); mask = _mm_xor_si128(mask, _mm_set1_epi32(~0)); v= _mm_add_epi64(v,mask); _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v); } for (;WorkerID<WorkersON;++WorkerID){ if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1; } </code></pre> Edit: Based on a comment by Stephen Canon I learned that there is a more efficient way to compare general 64-bit unsigned integers using the <code>pcmpgtq</code> instruction from SSE4.2: <pre class="prettyprint"><code>__m128i a,b; __m128i sign64 = _mm_set1_epi64x(0x8000000000000000L); __m128i aflip = _mm_xor_si128(a, sign64); __m128i bflip = _mm_xor_si128(b, sign64); __m128i cmp = _mm_cmpgt_epi64(aflip,bflip); </code></pre>

Saturated substraction - AVX or SSE4.2

Tags:

c

optimization

gcc

avx

sse

I am improving the performance of a program (C) and I can't obtain better execution time improving the most "expensive" loop.

I have to substract 1 from each element of a unsigned long int array, if the element is greater than zero.

The loop is:

unsigned long int * WorkerDataTime;
...
for (WorkerID=0;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

And I try this:

for (WorkerID=0;WorkerID<WorkersON;++WorkerID){
    int rest = WorkerDataTime[WorkerID] > 0;
    WorkerDataTime[WorkerID] = WorkerDataTime[WorkerID] - rest;
}

But the execution time is similar.

THE QUESTION: Is there any intrinsec instruction (SSE4.2, AVX...) to do this directly? (I'm using gcc 4.8.2)

I know that is possible with char or short elements. (_mm_subs_epi8 and _mm_subs_epi16) and I can't use AVX2.

Thank you.

233

asked Nov 03 '14 11:11

Cristian Morales

2 Answers

With SSE4 it is possible using three instructions. Here is a code that processes an entire array, decrementing all unsigned integers that aren't zero:

void clampedDecrement_SSE (__m128i * data, size_t count)
{
  // processes 2 elements each, no checks for alignment done.
  // count must be multiple of 2.

  size_t i;
  count /= 2;

  __m128i zero = _mm_set1_epi32(0);
  __m128i ones = _mm_set1_epi32(~0);

  for (i=0; i<count; i++)
  {
    __m128i values, mask;

    // load 2 64 bit integers:
    values = _mm_load_si128 (data);

    // compare against zero. Gives either 0 or ~0 (on match)
    mask   = _mm_cmpeq_epi64 (values, zero);

    // negate above mask. Yields -1 for all non zero elements, 0 otherwise:
    mask   = _mm_xor_si128(mask, ones);

    // now just add the mask for saturated unsigned decrement operation:
    values = _mm_add_epi64(values, mask);

    // and store the result back to memory:
   _mm_store_si128(data,values);
   data++;
  }
}

With AVX2 we can improve upon this and process 4 elements at at time:

void clampedDecrement (__m256i * data, size_t count)
{
  // processes 4 elements each, no checks for alignment done.
  // count must be multiple of 4.

  size_t i;
  count /= 4;

  // we need some constants:
  __m256i zero = _mm256_set1_epi32(0);
  __m256i ones = _mm256_set1_epi32(~0);

  for (i=0; i<count; i++)
  {
    __m256i values, mask;

    // load 4 64 bit integers:
    values = _mm256_load_si256 (data);

    // compare against zero. Gives either 0 or ~0 (on match)
    mask   = _mm256_cmpeq_epi64 (values, zero);

    // negate above mask. Yields -1 for all non zero elements, 0 otherwise:
    mask   = _mm256_xor_si256(mask, ones);

    // now just add the mask for saturated unsigned decrement operation:
    values = _mm256_add_epi64(values, mask);

    // and store the result back to memory:
   _mm256_store_si256(data,values);
   data++;
  }
}

EDIT: added SSE code version.

134

answered Oct 18 '22 00:10

Nils Pipenbrinck

Unless your CPU has XOP than there is no efficient way to compare unsigned 64-bit integers.

I ripped the following from Agner Fog's Vector Class Library. This shows how to compare unsigned 64-bit integers.

static inline Vec2qb operator > (Vec2uq const & a, Vec2uq const & b) {
#ifdef __XOP__  // AMD XOP instruction set
    return Vec2q(_mm_comgt_epu64(a,b));
#else  // SSE2 instruction set
    __m128i sign32  = _mm_set1_epi32(0x80000000);          // sign bit of each dword
    __m128i aflip   = _mm_xor_si128(a,sign32);             // a with sign bits flipped
    __m128i bflip   = _mm_xor_si128(b,sign32);             // b with sign bits flipped
    __m128i equal   = _mm_cmpeq_epi32(a,b);                // a == b, dwords
    __m128i bigger  = _mm_cmpgt_epi32(aflip,bflip);        // a > b, dwords
    __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0);      // a > b, low dwords copied to high dwords
    __m128i eqbig   = _mm_and_si128(equal,biggerl);        // high part equal and low part bigger
    __m128i hibig   = _mm_or_si128(bigger,eqbig);          // high part bigger or high part equal and low part bigger
    __m128i big     = _mm_shuffle_epi32(hibig,0xF5);       // result copied to low part
    return  Vec2qb(Vec2q(big));
#endif
}

So if you CPU supports XOP than you should try compiling with -mxop and see if the loop is vectorized.

Edit: If GCC does not vectorize this like you want and your CPU has XOP you can do

for (WorkerID=0; WorkerID<WorkersON-1; workerID+=2){
    __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]);
    __m128i cmp = _mm_comgt_epu64(v, _mm_setzero_si128());
    v = _mm_add_epi64(v,cmp);
    _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v);
}
for (;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

Compile with -mxop and include #include <x86intrin.h>.

Edit: as Nils Pipenbrinck pointed out if you don't have XOP you can do this with one more instruction using _mm_xor_si128:

for (WorkerID=0; WorkerID<WorkersON-1; WorkerID+=2){
    __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]);
    __m128i mask = _mm_cmpeq_epi64(v,_mm_setzero_si128());
    mask = _mm_xor_si128(mask, _mm_set1_epi32(~0));
    v= _mm_add_epi64(v,mask);
    _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v);
}
for (;WorkerID<WorkersON;++WorkerID){
    if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1;
}

Edit: Based on a comment by Stephen Canon I learned that there is a more efficient way to compare general 64-bit unsigned integers using the pcmpgtq instruction from SSE4.2:

__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

answered Oct 18 '22 01:10

Z boson

Related questions
                            
                                Behaviour of PROT_READ and PROT_WRITE with mprotect
                            
                                Convert integer value to ip-address
                            
                                C - return array of unknown size
                            
                                Putchar and Getchar in C
                            
                                Why AND a uint32 value with 0xFFFFFFFF?
                            
                                glibc detected free() invalid pointer
                            
                                GDB doesn't find sources
                            
                                Save pointer's memory address
                            
                                C: Is it legal to subscript an array of incomplete type?
                            
                                C++ (Unix): Open a text file with the default editor
                            
                                Is it possible to hide a portion of a standard header (in C)?
                            
                                Python Embedding: PyImport_Import not from the current directory
                            
                                Data type promotion or demotion in C
                            
                                Why package a single enum in a struct?
                            
                                Specify long command line arguments without the short format getopt
                            
                                What is the time complexity of the following function?
                            
                                Where do Zombie processes go after their parent dies?
                            
                                Can I add new format specifier type to C printf in Linux platform
                            
                                What is this C syntax (used in Linux drivers/net/bonding/bond_main.c)?
                            
                                How to detect if a compiler supports static_assert?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With