How to make premultiplied alpha function faster using SIMD instructions?

Question

I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel).

    for (int i = 0, max = width * height * 4; i < max; i+=4) {
        data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255;
        data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255;
        data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255;
    }

You will find below my current implementation but I think it could be much faster and I'm wasting precious CPU cycles. I tested it with quick-bench.com and it shows encouraging results but what should I change to make it blazing fast?

Thanks

-------- UPDATED 09/06/2019 --------

Based on @chtz and @Peter Cordes comments I put together a repository to assess the different solutions here are the results. Do you think it can be better?

Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 1.24, 1.60, 1.68
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
v1_plain_mean         1189884 ns      1189573 ns         1000 itr=840.865/s
v1_plain_median       1184059 ns      1183786 ns         1000 itr=844.747/s
v1_plain_stddev         20575 ns        20166 ns         1000 itr=13.4227/s

v1_simd_x86_mean       297866 ns       297784 ns         1000 itr=3.3616k/s
v1_simd_x86_median     294995 ns       294927 ns         1000 itr=3.39067k/s
v1_simd_x86_stddev       9863 ns         9794 ns         1000 itr=105.51/s

Thanks Dot and Beached (discord #include)
v2_plain_mean          323541 ns       323451 ns         1000 itr=3.09678k/s
v2_plain_median        318932 ns       318855 ns         1000 itr=3.13623k/s
v2_plain_stddev         13598 ns        13542 ns         1000 itr=122.588/s

Thanks Peter Cordes (stackoverflow)
v3_simd_x86_mean       264323 ns       264247 ns         1000 itr=3.79233k/s
v3_simd_x86_median     260641 ns       260560 ns         1000 itr=3.83788k/s
v3_simd_x86_stddev      12466 ns        12422 ns         1000 itr=170.36/s

Thanks chtz (stackoverflow)
v4_simd_x86_mean       266174 ns       266109 ns         1000 itr=3.76502k/s
v4_simd_x86_median     262940 ns       262916 ns         1000 itr=3.8035k/s
v4_simd_x86_stddev      11993 ns        11962 ns         1000 itr=159.906/s

-------- UPDATED 10/06/2019 --------

I added the AVX2 version and used chtz's tip. Using 255 for alpha value in color_odd I was able to remove _mm_blendv_epi8 and improve the benchmark.

Thanks Peter and chtz

v3_simd_x86_mean       246171 ns       246107 ns          100 itr=4.06517k/s
v3_simd_x86_median     245191 ns       245167 ns          100 itr=4.07885k/s
v3_simd_x86_stddev       5423 ns         5406 ns          100 itr=87.13/s

// AVX2
v5_simd_x86_mean       158456 ns       158409 ns          100 itr=6.31411k/s
v5_simd_x86_median     158248 ns       158165 ns          100 itr=6.3225k/s
v5_simd_x86_stddev       2340 ns         2329 ns          100 itr=92.1406/s

Peter Cordes · Accepted Answer

If you can you use SSSE3, _mm_shuffle_epi8 lets you create the __m128i alpha vector, instead of AND/shift/OR.

pshufb will zero bytes where the high bit of the shuffle-control vector element is set. (Shuffle throughput is easily a bottleneck on Intel Haswell and later, so using immediate shifts or AND is still good for the other operations where you can get it done with one instruction.)

On Skylake and later, it's probably a win to use SSE4.1 pblendvb to merge alpha instead of AND/ANDN/OR. (On Haswell, the 2 uops of pblendvb can only run on port 5. That might actually be ok because there are enough other uops that this won't create a shuffle bottlenck.)

On Skylake, non-VEX pblendvb is a single-uop instruction that runs on any port. (The VEX version is 2 uops for any port, so it's still strictly better than AND/ANDN/OR, but not as good as the SSE version. Although the SSE version uses an implicit XMM0 input, so it costs an extra movdqa instruction unless your loop only ever uses pblendvb with the same blend mask. Or if you unroll then it can maybe amortize that movdqa to set XMM0.)

Also, _mm_srli_epi16 by 7 and _mm_slli_epi16(color_odd, 8) could be just a single shift, with maybe an AND. Or a pblendvb avoids the need to clear garbage like you do before an OR.

I'm not sure if you could use _mm_mulhrs_epi16 to mul-and-shift, but probably not. It isn't the right shift, and the +1 for "rounding" isn't what you want.

Obviously an AVX2 version of this could do twice as much work per instruction, giving a factor 2 speedup on Haswell / Skylake for the main loop. Probably somewhat neutral on Ryzen, where 256b instructions decode into 2 uops. (Or more for lane-crossing shuffles, but you don't have those.)

The worst case cleanup would have to run more times, but this should still be negligible.

chtz · Answer

I was playing around a bit with this. I think the best solution is to split the input from two registers into channels of 16bit integers (i.e., 8bit integer interleaved by 0x00). Then do the actual scaling (taking only 6 multiplications + 3shifts for 8 pixels, instead of 8+4, in your original approach), and then re-join the channels into pixels.

Proof of concept (untested) assuming input is aligned and number of pixels are a multiple of 8, Version 2.0 (see history for previous version):

void alpha_premultiply(__m128i *input, int length)
{
    for(__m128i* last = input + (length & ~1); input!=last; input+=2)
    {
        // load data and split channels:
        __m128i abgr = _mm_load_si128(input);
        __m128i ABGR = _mm_load_si128(input+1);
        __m128i __ab = _mm_srli_epi32(abgr,16);
        __m128i GR__ = _mm_slli_epi32(ABGR,16);
        __m128i ABab = _mm_blend_epi16(ABGR, __ab, 0x55);
        __m128i GRgr = _mm_blend_epi16(GR__, abgr, 0x55);
        __m128i A_a_ = _mm_and_si128(ABab, _mm_set1_epi16(0xFF00));
        __m128i G_g_ = _mm_and_si128(GRgr, _mm_set1_epi16(0xFF00));
        __m128i R_r_ = _mm_slli_epi16(GRgr, 8);
        __m128i B_b_ = _mm_slli_epi16(ABab, 8);

        // actual alpha-scaling:
        __m128i inv = _mm_set1_epi16(0x8081); // = ceil((1<<(16+7))/255.0)
        G_g_ = _mm_mulhi_epu16(_mm_mulhi_epu16(G_g_, A_a_), inv);
        // shift 7 to the right and 8 to the left, or shift 1 to the left and mask:
        G_g_ = _mm_and_si128(_mm_add_epi16(G_g_, G_g_), _mm_set1_epi16(0xFF00));
        __m128i _R_r = _mm_mulhi_epu16(_mm_mulhi_epu16(R_r_, A_a_), inv);
        _R_r = _mm_srli_epi16(_R_r,7);
        __m128i _B_b = _mm_mulhi_epu16(_mm_mulhi_epu16(B_b_, A_a_), inv);
        _B_b = _mm_srli_epi16(_B_b,7);

        // re-assemble channels:
        GRgr = _mm_or_si128(_R_r, G_g_);
        ABab = _mm_or_si128(A_a_, _B_b);

        __m128i __GR = _mm_srli_epi32(GRgr, 16);
        __m128i ab__ = _mm_slli_epi32(ABab, 16);

        ABGR = _mm_blend_epi16(ABab, __GR, 0x55);
        abgr = _mm_blend_epi16(ab__, GRgr, 0x55);

        // store result
        _mm_store_si128(input, abgr);
        _mm_store_si128(input+1, ABGR);
    }
}

Variable names use _ to mark a 0, and lowest address byte is on the right (to be less confusing with shift and bit-operations). Each register will hold 4 sequential pixels, or 4+4 interleaved channels. Lower and uppercase letters are from different input locations. (Godbolt: https://godbolt.org/z/OcxAfJ)

On Haswell (or earlier), this would bottleneck on port 0 (shift and multiplications), but with SSSE3 you could replace all 8- and 16-shifts by _mm_alignr_epi8. And it would be better to leave _R_r and _B_b at the lower bytes (uses a pand instead of a psllw, but requires shifting A_a_ to _A_a). Possible pitfall: clang replaces _mm_alignr_epi8 by corresponding shift instructions: https://godbolt.org/z/BhEZoV (maybe there are flags to prohibit clang to replace these. GCC uses palignr: https://godbolt.org/z/lu-jNQ)

On Skylake this might be optimal as it is (except for porting to AVX2, of course). There are 8 shifts, 6 multiplications and 1 addition, i.e. 15 operations on ports 0 and 1. Furthermore, 4 blends on port 5, and 5 and/or operations (4 on p5 and another on either p0 or p1), i.e., 8 uops per port for 8 pixels (or 16 pixels with AVX2).

Code should be very easy to port to AVX2 (and using AVX1 alone will save some register copies). Finally, to make the code SSE2 compatible, only the blend instructions need to be replaced by corresponding and+or operations.

How to make premultiplied alpha function faster using SIMD instructions?

Tags:

c++

x86

avx

simd

sse

Mathieu Garaud

2 Answers

Peter Cordes

chtz

Recent Activity

Donate For Us

How to make premultiplied alpha function faster using SIMD instructions?

Tags:

c++

x86

avx

simd

sse

Mathieu Garaud

2 Answers

Peter Cordes

chtz

Related questions

Recent Activity

Donate For Us