I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel).
for (int i = 0, max = width * height * 4; i < max; i+=4) {
data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255;
data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255;
data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255;
}
You will find below my current implementation but I think it could be much faster and I'm wasting precious CPU cycles. I tested it with quick-bench.com and it shows encouraging results but what should I change to make it blazing fast?
Thanks
-------- UPDATED 09/06/2019 --------
Based on @chtz and @Peter Cordes comments I put together a repository to assess the different solutions here are the results. Do you think it can be better?
Run on (8 X 3100 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 8388K (x1)
Load Average: 1.24, 1.60, 1.68
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
v1_plain_mean 1189884 ns 1189573 ns 1000 itr=840.865/s
v1_plain_median 1184059 ns 1183786 ns 1000 itr=844.747/s
v1_plain_stddev 20575 ns 20166 ns 1000 itr=13.4227/s
v1_simd_x86_mean 297866 ns 297784 ns 1000 itr=3.3616k/s
v1_simd_x86_median 294995 ns 294927 ns 1000 itr=3.39067k/s
v1_simd_x86_stddev 9863 ns 9794 ns 1000 itr=105.51/s
Thanks Dot and Beached (discord #include)
v2_plain_mean 323541 ns 323451 ns 1000 itr=3.09678k/s
v2_plain_median 318932 ns 318855 ns 1000 itr=3.13623k/s
v2_plain_stddev 13598 ns 13542 ns 1000 itr=122.588/s
Thanks Peter Cordes (stackoverflow)
v3_simd_x86_mean 264323 ns 264247 ns 1000 itr=3.79233k/s
v3_simd_x86_median 260641 ns 260560 ns 1000 itr=3.83788k/s
v3_simd_x86_stddev 12466 ns 12422 ns 1000 itr=170.36/s
Thanks chtz (stackoverflow)
v4_simd_x86_mean 266174 ns 266109 ns 1000 itr=3.76502k/s
v4_simd_x86_median 262940 ns 262916 ns 1000 itr=3.8035k/s
v4_simd_x86_stddev 11993 ns 11962 ns 1000 itr=159.906/s
-------- UPDATED 10/06/2019 --------
I added the AVX2 version and used chtz's tip. Using 255 for alpha value in color_odd I was able to remove _mm_blendv_epi8
and improve the benchmark.
Thanks Peter and chtz
v3_simd_x86_mean 246171 ns 246107 ns 100 itr=4.06517k/s
v3_simd_x86_median 245191 ns 245167 ns 100 itr=4.07885k/s
v3_simd_x86_stddev 5423 ns 5406 ns 100 itr=87.13/s
// AVX2
v5_simd_x86_mean 158456 ns 158409 ns 100 itr=6.31411k/s
v5_simd_x86_median 158248 ns 158165 ns 100 itr=6.3225k/s
v5_simd_x86_stddev 2340 ns 2329 ns 100 itr=92.1406/s
If you can you use SSSE3, _mm_shuffle_epi8
lets you create the __m128i alpha
vector, instead of AND/shift/OR.
pshufb
will zero bytes where the high bit of the shuffle-control vector element is set. (Shuffle throughput is easily a bottleneck on Intel Haswell and later, so using immediate shifts or AND is still good for the other operations where you can get it done with one instruction.)
On Skylake and later, it's probably a win to use SSE4.1 pblendvb
to merge alpha instead of AND/ANDN/OR. (On Haswell, the 2 uops of pblendvb
can only run on port 5. That might actually be ok because there are enough other uops that this won't create a shuffle bottlenck.)
On Skylake, non-VEX pblendvb
is a single-uop instruction that runs on any port. (The VEX version is 2 uops for any port, so it's still strictly better than AND/ANDN/OR, but not as good as the SSE version. Although the SSE version uses an implicit XMM0 input, so it costs an extra movdqa
instruction unless your loop only ever uses pblendvb
with the same blend mask. Or if you unroll then it can maybe amortize that movdqa
to set XMM0.)
Also, _mm_srli_epi16
by 7 and _mm_slli_epi16(color_odd, 8)
could be just a single shift, with maybe an AND. Or a pblendvb
avoids the need to clear garbage like you do before an OR.
I'm not sure if you could use _mm_mulhrs_epi16
to mul-and-shift, but probably not. It isn't the right shift, and the +1
for "rounding" isn't what you want.
Obviously an AVX2 version of this could do twice as much work per instruction, giving a factor 2 speedup on Haswell / Skylake for the main loop. Probably somewhat neutral on Ryzen, where 256b instructions decode into 2 uops. (Or more for lane-crossing shuffles, but you don't have those.)
The worst case cleanup would have to run more times, but this should still be negligible.
I was playing around a bit with this. I think the best solution is to split the input from two registers into channels of 16bit integers (i.e., 8bit integer interleaved by 0x00
). Then do the actual scaling (taking only 6 multiplications + 3shifts for 8 pixels, instead of 8+4, in your original approach), and then re-join the channels into pixels.
Proof of concept (untested) assuming input is aligned and number of pixels are a multiple of 8, Version 2.0 (see history for previous version):
void alpha_premultiply(__m128i *input, int length)
{
for(__m128i* last = input + (length & ~1); input!=last; input+=2)
{
// load data and split channels:
__m128i abgr = _mm_load_si128(input);
__m128i ABGR = _mm_load_si128(input+1);
__m128i __ab = _mm_srli_epi32(abgr,16);
__m128i GR__ = _mm_slli_epi32(ABGR,16);
__m128i ABab = _mm_blend_epi16(ABGR, __ab, 0x55);
__m128i GRgr = _mm_blend_epi16(GR__, abgr, 0x55);
__m128i A_a_ = _mm_and_si128(ABab, _mm_set1_epi16(0xFF00));
__m128i G_g_ = _mm_and_si128(GRgr, _mm_set1_epi16(0xFF00));
__m128i R_r_ = _mm_slli_epi16(GRgr, 8);
__m128i B_b_ = _mm_slli_epi16(ABab, 8);
// actual alpha-scaling:
__m128i inv = _mm_set1_epi16(0x8081); // = ceil((1<<(16+7))/255.0)
G_g_ = _mm_mulhi_epu16(_mm_mulhi_epu16(G_g_, A_a_), inv);
// shift 7 to the right and 8 to the left, or shift 1 to the left and mask:
G_g_ = _mm_and_si128(_mm_add_epi16(G_g_, G_g_), _mm_set1_epi16(0xFF00));
__m128i _R_r = _mm_mulhi_epu16(_mm_mulhi_epu16(R_r_, A_a_), inv);
_R_r = _mm_srli_epi16(_R_r,7);
__m128i _B_b = _mm_mulhi_epu16(_mm_mulhi_epu16(B_b_, A_a_), inv);
_B_b = _mm_srli_epi16(_B_b,7);
// re-assemble channels:
GRgr = _mm_or_si128(_R_r, G_g_);
ABab = _mm_or_si128(A_a_, _B_b);
__m128i __GR = _mm_srli_epi32(GRgr, 16);
__m128i ab__ = _mm_slli_epi32(ABab, 16);
ABGR = _mm_blend_epi16(ABab, __GR, 0x55);
abgr = _mm_blend_epi16(ab__, GRgr, 0x55);
// store result
_mm_store_si128(input, abgr);
_mm_store_si128(input+1, ABGR);
}
}
Variable names use _
to mark a 0, and lowest address byte is on the right (to be less confusing with shift and bit-operations). Each register will hold 4 sequential pixels, or 4+4 interleaved channels. Lower and uppercase letters are from different input locations. (Godbolt: https://godbolt.org/z/OcxAfJ)
On Haswell (or earlier), this would bottleneck on port 0 (shift and multiplications), but with SSSE3 you could replace all 8- and 16-shifts by _mm_alignr_epi8
. And it would be better to leave _R_r
and _B_b
at the lower bytes (uses a pand
instead of a psllw
, but requires shifting A_a_
to _A_a
). Possible pitfall: clang replaces _mm_alignr_epi8
by corresponding shift instructions: https://godbolt.org/z/BhEZoV (maybe there are flags to prohibit clang to replace these. GCC uses palignr
: https://godbolt.org/z/lu-jNQ)
On Skylake this might be optimal as it is (except for porting to AVX2, of course). There are 8 shifts, 6 multiplications and 1 addition, i.e. 15 operations on ports 0 and 1. Furthermore, 4 blends on port 5, and 5 and/or operations (4 on p5 and another on either p0 or p1), i.e., 8 uops per port for 8 pixels (or 16 pixels with AVX2).
Code should be very easy to port to AVX2 (and using AVX1 alone will save some register copies). Finally, to make the code SSE2 compatible, only the blend instructions need to be replaced by corresponding and+or operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With