Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize a cycle?

I have the following bottleneck function.

typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;
     for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
        *p3 = (*p1 < *p2 ) ? b1 : b2;
    }
}

I want to replace C++ code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8 but it used signed compare. I need unsigned compare.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: I do not want to use multi-threading in this case.

like image 791
Alexey Malistov Avatar asked Oct 21 '10 11:10

Alexey Malistov


1 Answers

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

  • use _mm_min_epu8 to get the unsigned min of p1, p2
  • compare this min for equality with p2 using _mm_cmpeq_epi8
  • the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
  • you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

like image 97
Paul R Avatar answered Sep 20 '22 06:09

Paul R