I'm learning to use SIMD capabilities by re-writing my personal image processing library using vector intrinsics. One basic function is a simple "array +=
," i.e.
void arrayAdd(unsigned char* A, unsigned char* B, size_t n) {
for(size_t i=0; i < n; i++) { B[i] += A[i] };
}
For arbitrary array lengths, the obvious SIMD code (assuming aligned by 16) is something like:
size_t i = 0;
__m128i xmm0, xmm1;
n16 = n - (n % 16);
for (; i < n16; i+=16) {
xmm0 = _mm_load_si128( (__m128i*) (A + i) );
xmm1 = _mm_load_si128( (__m128i*) (B + i) );
xmm1 = _mm_add_epi8( xmm0, xmm1 );
_mm_store_si128( (__m128i*) (B + i), xmm1 );
}
for (; i < n; i++) { B[i] += A[i]; }
But is it possible to do all the additions with SIMD instructions? I thought of trying this:
__m128i mask = (0x100<<8*(n - n16))-1;
_mm_maskmoveu_si128( xmm1, mask, (__m128i*) (B + i) );
for the extra elements, but will that result in undefined behavior? The mask
should guarantee no access is actually made past the array bounds (I think). The alternative is to do the extra elements first, but then the array needs to be aligned by n-n16
, which doesn't seem right.
Is there another, more optimal pattern such vectorized loops?
One option is to pad your array to a multiple of 16 bytes. Then you can do 128 bit load/add/store and simply ignore the results following the point you care about.
For large arrays though the overhead of the byte by byte "epilog" is going to be very small. Unrolling the loop may improve performance more, something like:
for (; i < n32; i+=32) {
xmm0 = _mm_load_si128( (__m128i*) (A + i) );
xmm1 = _mm_load_si128( (__m128i*) (B + i) );
xmm2 = _mm_load_si128( (__m128i*) (A + i + 16) );
xmm3 = _mm_load_si128( (__m128i*) (B + i + 16) );
xmm1 = _mm_add_epi8( xmm0, xmm1 );
xmm3 = _mm_add_epi8( xmm2, xmm3 );
_mm_store_si128( (__m128i*) (B + i), xmm1 );
_mm_store_si128( (__m128i*) (B + i + 16), xmm3 );
}
// Do another 128 bit load/add/store here if required
But it's hard to say without doing some profiling.
You could also do an unaligned load/store at the end (assuming you have more than 16 bytes) though this will probably not make a big difference. E.g. if you have 20 bytes you do one load/store to offset 0 and another unaligned load/add/store (_mm_storeu_si128
, __mm_loadu_si128
) to offset 4.
You could use _mm_maskmoveu_si128
but you need to get the mask into an xmm register, and your sample code isn't going to work. You probably want to set the mask register to all FF's and then use a shift to align it. At the end of the day, it will probably be slower than the unaligned load/add/store.
This would be something like:
mask = _mm_cmpeq_epi8(mask, mask); // Set to all FF's
mask = _mm_srli_si128(mask, 16-(n%16)); // Align mask
_mm_maskmoveu_si128(xmm, mask, A + i);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With