SSE intrinsics - comparison if/else optimization

Question

I have been trying to optimize some code which handles raw pixel data. Currently the C++ implementation of the code is too slow, so I've been trying to make some grounds using SSE intrinsics (SSE/2/3 not using 4) with MSVC 2008. Considering it's my first time digging in this low, I've made some good progress.

Unfortunately, I've come to a particular piece of code which has me stuck:

//Begin bad/suboptimal SSE code
__m128i vnMask  = _mm_set1_epi16(0x0001);
__m128i vn1     = _mm_and_si128(vnFloors, vnMask);

for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m++)
{
    bool bIsEvenFloor   = vn1.m128i_u16[m]==0;

    vnPxChroma.m128i_u16[m] = 
        m%2==0 
            ?
        (bIsEvenFloor ? vnPxCeilChroma.m128i_u16[m] : vnPxFloorChroma.m128i_u16[m])
            :
        (bIsEvenFloor ? vnPxFloorChroma.m128i_u16[m] : vnPxCeilChroma.m128i_u16[m]);
}

Currently, I'm defaulting to using a C++ implementation for this section because I can't quite get my head around how this can be optimized using SSE - I find the SSE intrinsics for comparisons to be a bit tricky.

Any suggestions/tips would be much appreciated.

EDIT: The equivalent C++ code which handles a single pixel at a time would be:

short pxCl=0, pxFl=0;
short uv=0; // chroma component of pixel
short y=0;  // luma component of pixel

for(int i = 0; i < end-of-line, ++i)
{
    //Initialize pxCl, and pxFL
    //...

    bool bIsEvenI       = (i%2)==0;
    bool bIsEvenFloor   = (m_pnDistancesFloor[i] % 2)==0;

    uv = bIsEvenI ==0 
        ?
    (bIsEvenFloor ? pxCl : pxFl)
        :
    (bIsEvenFloor ? pxFl : pxCl);

    //Merge the Y/UV of the pixel;
    //...
}

Basically, I'm doing a nonlinear edge stretch from 4:3 to 16:9.

Dr. Andrew Burnett-Thompson · Accepted Answer

Ok, so I don't know what this code is doing, however I do know you are asking how to optimise out the ternery operators and get this portion of code operating only in SSE. As a first step, I would recommend trying an approach using integer flags and multiplication to avoid a conditional operator. For instance:

This section

for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m++)
{
    bool bIsEvenFloor   = vn1.m128i_u16[m]==0;      

    vnPxChroma.m128i_u16[m] = m%2==0 ?  
      (bIsEvenFloor ? vnPxCeilChroma.m128i_u16[m] : vnPxFloorChroma.m128i_u16[m])  : 
      (bIsEvenFloor ? vnPxFloorChroma.m128i_u16[m] : vnPxCeilChroma.m128i_u16[m]); 
}

Is syntactically equivalent to this

// DISCLAIMER: Untested both in compilation and execution

// Process all m%2=0 in steps of 2
for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m+=2)
{
    // This line could surely pack muliple u16s into one SSE2 register
    uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0
    uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1

    // This line could surely perform an SSE2 multiply across multiple registers
    vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxCeilChroma.m128i_u16[m] + 
                              iIsOddFloor * vnPxFloorChroma.m128i_u16[m]
}

// Process all m%2!=0 in steps of 2
for(int m=1; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m+=2)
{
    uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0
    uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1

    vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxFloorChroma.m128i_u16[m] + 
                              iIsOddFloor * vnPxCeilChroma.m128i_u16[m]
}

Basically by splitting into two loops you lose the performance enhancement of serial memory access but drop a modulo operation and two conditional operators.

Now you say, you notice there are two boolean operators per loop as well as the multiplies which I might add are not SSE intrinsic implementations. What is stored in your vn1.m123i_u16[] array? Is it only zeros and ones? If so you don't need this part and can do away with it. If not, can you normalize your data in this array to only have zeros and ones? If the vn1.m123i_u16 array only contains ones and zeros then this code becomes

uint16 iIsOddFloor  = vn1.m128i_u16[m]
uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1

You will also notice I'm not using SSE multiplies to perform the isEvenFloor * vnPx... part nor to store the iIsEvenFloor and iIsOddFloor registers. Im sorry I can't remember the SSE intrinsics for u16 multiply/register off the top, but nevertheless I hope this approach is helpful. Some optimisations you should look in to:

// This line could surely pack muliple u16s into one SSE2 register
uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0
uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1

// This line could surely perform an SSE2 multiply across multiple registers
vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxCeilChroma.m128i_u16[m] + 
                          iIsOddFloor * vnPxFloorChroma.m128i_u16[m]

In this section of code you've posted, and my modification, we are still not making full use of the SSE1/2/3 intrinsics but it might provide some points on how that could be done (how to vectorize the code).

Finally I would say to test everything. Run the above code unaltered and profile it before making changes and profiling again. The actual performance numbers may surprise you!

Update 1:

I've been through the Intel SIMD Intrinsics documentation to pick out relevant intrinsics which could be of use for this. Specifically take a look at bitwise XOR, AND and MULT/ADD

__m128 Data Types
The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.

__m128i _mm_add_epi16(__m128i a, __m128i b)
Add the 8 signed or unsigned 16 bit integers in a to the 8 signed or unsigned 16-bit integers in b

__m128i _mm_mulhi_epu16(__m128i a, __m128i b)
Multiplies the 8 unsigned 16-bit integers from a by the 8-unsigned 16-bit integers from b. Packs the upper 16-bits of the 8-unsigned 32-bit results

R0=hiword(a0 * b0)
R1=hiword(a1 * b1)
R2=hiword(a2 * b2)
R3=hiword(a3 * b3)
..
R7=hiword(a7 * b7)

__m128i _mm_mullo_epi16(__m128i a, __m128i b)
Multiplies the 8 signed or unsigned 16-bit integers from a by the 8-signed or unsigned 16-bit integers from b. Packs the upper 16-bits of the 8-signed or unsigned 32-bit results

R0=loword(a0 * b0)
R1=loword(a1 * b1)
R2=loword(a2 * b2)
R3=loword(a3 * b3)
..
R7=loword(a7 * b7)

__m128i _mm_and_si128(__m128i a, __m128i b)
Perform a bitwise AND of the 128-bit value in m1 with the 128-bit value in m2.

__m128i _mm_andnot_si128(__m128i a, __m128i b)
Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128- bit value in a.

__m128i _mm_xor_si128(__m128i a, __m128i b)
Perform a bitwise XOR of the 128-bit value in m1 with the 128-bit value in m2.

ALSO from your code example for reference

uint16 u1 = u2 = u3 ... = u15 = 0x1
__m128i vnMask = _mm_set1_epi16(0x0001); // Sets the 8 signed 16-bit integer values.

uint16 vn1[i] = vnFloors[i] & 0x1
__m128i vn1 = _mm_and_si128(vnFloors, vnMask); // Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.

SSE intrinsics - comparison if/else optimization

Tags:

c++

sse

intrinsics

ZeroDefect

1 Answers

Dr. Andrew Burnett-Thompson

Recent Activity

Donate For Us

SSE intrinsics - comparison if/else optimization

Tags:

c++

sse

intrinsics

ZeroDefect

1 Answers

Dr. Andrew Burnett-Thompson

Related questions

Recent Activity

Donate For Us