I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring). The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time.
What seems very bad about this code, at least to me, is that there is a lot of insert/extract in it, which is not very elegant and probably slows everything down as well. Is there a better way to wrap data from one reg into another when shifting?
buf is the image data, 16-byte aligned. w/h are width and height, multiples of 16.
__m128i *p = (__m128i *) buf;
__m128i cur1, cur2, sum1, sum2, zeros, tmp1, tmp2, saved;
zeros = _mm_setzero_si128();
short shifted, last = 0, next;
// preload first row
cur1 = _mm_load_si128(p);
for (x = 1; x < (w * h) / 16; x++) {
// unpack
sum1 = sum2 = saved = cur1;
sum1 = _mm_unpacklo_epi8(sum1, zeros);
sum2 = _mm_unpackhi_epi8(sum2, zeros);
cur1 = tmp1 = sum1;
cur2 = tmp2 = sum2;
// "middle" pixel
sum1 = _mm_add_epi16(sum1, sum1);
sum2 = _mm_add_epi16(sum2, sum2);
// left pixel
cur2 = _mm_slli_si128(cur2, 2);
shifted = _mm_extract_epi16(cur1, 7);
cur2 = _mm_insert_epi16(cur2, shifted, 0);
cur1 = _mm_slli_si128(cur1, 2);
cur1 = _mm_insert_epi16(cur1, last, 0);
sum1 = _mm_add_epi16(sum1, cur1);
sum2 = _mm_add_epi16(sum2, cur2);
// right pixel
tmp1 = _mm_srli_si128(tmp1, 2);
shifted = _mm_extract_epi16(tmp2, 0);
tmp1 = _mm_insert_epi16(tmp1, shifted, 7);
tmp2 = _mm_srli_si128(tmp2, 2);
// preload next row
cur1 = _mm_load_si128(p + x);
// we need the first pixel of the next row for the "right" pixel
next = _mm_extract_epi16(cur1, 0) & 0xff;
tmp2 = _mm_insert_epi16(tmp2, next, 7);
// and the last pixel of last row for the next "left" pixel
last = ((uint16_t) _mm_extract_epi16(saved, 7)) >> 8;
sum1 = _mm_add_epi16(sum1, tmp1);
sum2 = _mm_add_epi16(sum2, tmp2);
// divide
sum1 = _mm_srli_epi16(sum1, 2);
sum2 = _mm_srli_epi16(sum2, 2);
sum1 = _mm_packus_epi16(sum1, sum2);
mm_store_si128(p + x - 1, sum1);
}
I suggest keeping the neighbouring pixels on the SSE register. That is, keep the result of the _mm_slli_si128 / _mm_srli_si128 in an SSE variable, and eliminate all of the insert and extract. My reasoning is that in older CPUs, the insert/extract instructions require communication between the SSE units and the general-purpose units, which is much slower than keeping the calculation within SSE, even if it spills over to the L1 cache.
When that is done, there should be only four 16-bit shifts ( _mm_slli_si128, _mm_srli_si128, not counting the divison shift ). My suggestion is to do a benchmark with your code, because by that time your code may have already hit the memory bandwidth limit .. which means you can't optimize anymore.
If the image is large (bigger than L2 size) and the output won't be read back soon, try use MOVNTDQ ( _mm_stream_si128 ) for writing back. According to several websites it is in SSE2, although you might want to double-check.
SIMD tutorial:
Some SIMD guru websites:
This kind of neighbourhood operation was always a pain with SSE, until SSE3.5 (aka SSSE3) came along, and PALIGNR (_mm_alignr_epi8) was introduced.
If you need backward compatibility with SSE2/SSE3 though, you can write an equivalent macro or inline function which emulates _mm_alignr_epi8 for SSE2/SSE3 and which drops through to _mm_alignr_epi8 when targetting SSE3.5/SSE4.
Another approach is to use misaligned loads to get the shifted data - this is relatively expensive on older CPUs (roughly twice the latency and half the throughput of aligned loads) but this may be acceptable depending on much much computation you're doing per load. It also has the benefit that on current Intel CPUs (Core i7) misaligned loads have no penalty compared to aligned loads, so your code will be quite efficient on Core i7 et al.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With