Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast copy every second byte to new memory area

I need a fast way to copy every second byte to a new malloc'd memory area. I have a raw image with RGB data and 16 bits per channel (48 bit) and want to create an RGB image with 8 bits per channel (24 bit).

Is there a faster method than copying bytewise? I don't know much about SSE2, but I suppose it's possible with SSE/SSE2.

like image 582
akw Avatar asked Mar 07 '23 22:03

akw


1 Answers

Your RGB data is packed, so we don't actually have to care about pixel boundaries. The problem is just packing every other byte of an array. (At least within each row of your image; if you use a row stride of 16 or 32B, the padding might not be a whole number of pixels.)

This can be done efficiently using SSE2, AVX, or AVX2 shuffles. (Also AVX512BW, and maybe even more with AVX512VBMI, but the first AVX512VBMI CPUs probably won't have a very efficient vpermt2b, a 2-input lane-crossing byte shuffle.)


You can use SSSE3 pshufb to grab the bytes you want, but it's only a 1-input shuffle that will give you 8 bytes of output. Storing 8 bytes at a time takes more total store instructions than storing 16 bytes at a time. (You'd also bottleneck on shuffle throughput on Intel CPUs since Haswell, which only have one shuffle port and thus one-per clock shuffle throughput). (You could also consider 2xpshufb + por to feed a 16B store, and that could be good on Ryzen. Use 2 different shuffle control vectors, one that puts the result in the low 64b and one that puts the result in the high 64b. See Convert 8 16 bit SSE register to 8bit data).

Instead, it's probably a win to use _mm_packus_epi16 (packuswb). But since it saturates instead of discarding bytes you don't want, you have to feed it input with the data you want to keep in the low byte of each 16-bit element.

In your case, that's probably the high byte of each RGB16 component, discarding the 8 least-significant bits from each color component. i.e. _mm_srli_epi16(v, 8). To zero the high byte in each 16-bit element, use _mm_and_si128(v, _mm_set1_epi16(0x00ff)) instead. (In that case, nevermind all the stuff about using an unaligned load to replace one of the shifts; that's the easy case and you should just use two ANDs to feed a PACKUS.)

That's more or less how gcc and clang auto-vectorize this, at -O3. Except they both screw up and waste significant instructions (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82356, https://bugs.llvm.org/show_bug.cgi?id=34773). Still, letting them auto-vectorize with SSE2 (baseline for x86-64), or with NEON for ARM or whatever, is a good safe way to get some performance without risk of introducing bugs while manually vectorizing. Outside of compiler bugs, anything they generate will correctly implement the C semantics of this code, which works for any size and alignment:

// gcc and clang both auto-vectorize this sub-optimally with SSE2.
// clang is *really* sub-optimal with AVX2, gcc no worse
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t *__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

See the code + asm for this and later versions on Godbolt.

// Compilers auto-vectorize sort of like this, but with different
// silly missed optimizations.
// This is a sort of reasonable SSE2 baseline with no manual unrolling.
void pack_high8(uint8_t *restrict dst, const uint16_t *restrict src, size_t bytes) {
  // TODO: handle non-multiple-of-16 sizes
  uint8_t *end_dst = dst + bytes;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)src);
     __m128i v1 = _mm_loadu_si128(((__m128i*)src)+1);
     v0 = _mm_srli_epi16(v0, 8);
     v1 = _mm_srli_epi16(v1, 8);
     __m128i pack = _mm_packus_epi16(v0, v1);
     _mm_storeu_si128((__m128i*)dst, pack);
     dst += 16;
     src += 16;  // 32 bytes, unsigned short
  } while(dst < end_dst);
}

But vector shift throughput is limited to 1 per clock in many microarchitectures (Intel before Skylake, AMD Bulldozer/Ryzen). Also, there's no load+shift asm instruction until AVX512, so it's hard to get all these operations through the pipeline. (i.e. we easily bottleneck on the front-end.)

Instead of shifting, we can load from an address that's offset by one byte so the bytes we want are in the right place. AND to mask off the bytes we want has good throughput, especially with AVX where the compiler can fold the load+and into one instruction. If the input is 32-byte aligned, and we only do this offset-load trick for the odd vectors, our loads will never cross a cache-line boundary. With loop unrolling, this is probably the best bet for SSE2 or AVX (without AVX2) across many CPUs.

// take both args as uint8_t* so we can offset by 1 byte to replace a shift with an AND
// if src is 32B-aligned, we never have cache-line splits
void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)src);
     __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
     v0 = _mm_srli_epi16(v0, 8);
     __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
     __m128i pack = _mm_packus_epi16(v0, v1);
     _mm_store_si128((__m128i*)dst, pack);
     dst += 16;
     src += 32;  // 32 bytes
  } while(dst < end_dst);
}

Without AVX, the inner loop takes 6 instructions (6 uops) per 16B vector of results. (With AVX it's only 5, since the load folds into the and). Since this totally bottlenecks on the front-end, loop unrolling helps a lot. gcc -O3 -funroll-loops looks pretty good for this manually-vectorized version, especially with gcc -O3 -funroll-loops -march=sandybridge to enable AVX.

With AVX, it might be worth doing both v0 and v1 with and, to reduce the front-end bottleneck at the cost of having cache-line splits. (And occasional page-splits). But maybe not, depending on the uarch, and if your data already is misaligned or not. (Branching on that could be worth it, since you need to max out cache bandwidth if data is hot in L1D).

With AVX2, a 256b version of this with 256b loads should work well on Haswell/Skylake. With src 64B-aligned, the offset-load will still never cache-line split. (It will always load bytes [62:31] of a cache line, and the v0 load will always load bytes [31:0]). But pack work within 128b lanes, so after the pack you have to shuffle (with vpermq) to put 64-bit chunks into the right order. Look at how gcc auto-vectorizes the scalar baseline version with vpackuswb ymm7, ymm5, ymm6 / vpermq ymm8, ymm7, 0xD8.

With AVX512F, this trick stops working because a 64B load has to be aligned to stay within a single 64B cache line. But with AVX512, different shuffles are available, and ALU uop throughput is more precious (on Skylake-AVX512, where port1 shuts down while 512b uops are in flight). So v = load+shift -> __m256i packed = _mm512_cvtepi16_epi8(v) might work well, even though it only does 256b stores.

The right choice probably depends on whether your src and dst are usually 64B aligned. KNL doesn't have AVX512BW, so this probably only applies to Skylake-AVX512 anyway.

like image 50
Peter Cordes Avatar answered Mar 16 '23 07:03

Peter Cordes