I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?
// unsigned char *new_image is loaded with data
...
float buffer[8];
buffer[x ] = new_image[x];
buffer[x + 1] = new_image[x + 1];
buffer[x + 2] = new_image[x + 2];
buffer[x + 3] = new_image[x + 3];
buffer[x + 4] = new_image[x + 4];
buffer[x + 5] = new_image[x + 5];
buffer[x + 6] = new_image[x + 6];
buffer[x + 7] = new_image[x + 7];
// buffer is then used for further operations
...
//What I want instead in pseudocode:
__m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];
If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32-bit integers in a 256b register. From there, conversion to float can happen in-place.
; rsi = new_image
VPMOVZXBD ymm0, [rsi] ; or SX to sign-extend (Byte to DWord)
VCVTDQ2PS ymm0, ymm0 ; convert to packed foat
This is a good strategy even if you want to do this for multiple vectors, but even better might be a 128-bit broadcast load to feed vpmovzxbd ymm,xmm
and vpshufb ymm
(_mm256_shuffle_epi8
) for the high 64 bits, because Intel SnB-family CPUs don't micro-fuse a vpmovzx ymm,mem
, only only vpmovzx xmm,mem
. (https://agner.org/optimize/). Broadcast loads are single uop with no ALU port required, running purely in a load port. So this is 3 total uops to bcast-load + vpmovzx + vpshufb.
(TODO: write an intrinsics version of that. It also sidesteps the problem of missed optimizations for _mm_loadl_epi64
-> _mm256_cvtepu8_epi32
.)
Of course this requires a shuffle control vector in another register, so it's only worth it if you can use that multiple times.
vpshufb
is usable because the data needed for each lane is there from the broadcast, and the high bit of the shuffle-control will zero the corresponding element.
This broadcast + shuffle strategy might be good on Ryzen; Agner Fog doesn't list uop counts for vpmovsx/zx ymm
on it.
Do not do something like a 128-bit or 256-bit load and then shuffle that to feed further vpmovzx
instructions. Total shuffle throughput will probably already be a bottleneck because vpmovzx
is a shuffle. Intel Haswell/Skylake (the most common AVX2 uarches) have 1-per-clock shuffles but 2-per-clock loads. Using extra shuffle instructions instead of folding separate memory operands into vpmovzxbd
is terrible. Only if you can reduce total uop count like I suggested with broadcast-load + vpmovzxbd + vpshufb is it a win.
My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant for converting back to uint8_t
. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb
, because they work in-lane, unlike vpmovzx
.
With only AVX1, not AVX2, you should do:
VPMOVZXBD xmm0, [rsi]
VPMOVZXBD xmm1, [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1 ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS ymm0, ymm0 ; convert to packed float. Yes, works without AVX2
You of course never need an array of float, just __m256
vectors.
GCC / MSVC missed optimizations for VPMOVZXBD ymm,[mem]
with intrinsics
GCC and MSVC are bad at folding a _mm_loadl_epi64
into a memory operand for vpmovzx*
. (But at least there is a load intrinsic of the right width, unlike for pmovzxbq xmm, word [mem]
.)
We get a vmovq
load and then a separate vpmovzx
with an XMM input. (With ICC and clang3.6+ we get safe + optimal code from using _mm_loadl_epi64
, like from gcc9+)
But gcc8.3 and earlier can fold a _mm_loadu_si128
16-byte load intrinsic into an 8-byte memory operand. This gives optimal asm at -O3
on GCC, but is unsafe at -O0
where it compiles to an actual vmovdqu
load that touches more data that we actually load, and could go off the end of a page.
Two gcc bugs submitted because of this answer:
MOVQ m64, %xmm
in 32bit mode. (TODO: report this for clang/LLVM as well?)There's no intrinsic to use SSE4.1 pmovsx
/ pmovzx
as a load, only with a __m128i
source operand. But the asm instructions only read the amount of data they actually use, not a 16-byte __m128i
memory source operand. Unlike punpck*
, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).
So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__
is Bad, making it possible to create bugs that only happen in the debug build or only in the optimized build!
#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif
__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef USE_MOVQ // compiles to an actual movq then movzx ymm, xmm with gcc8.3 -O3
__m128i small_load = _mm_loadl_epi64( (const __m128i*)p);
#else // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
__m128i small_load = _mm_loadu_si128( (const __m128i*)p );
#endif
__m256i intvec = _mm256_cvtepu8_epi32( small_load );
//__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p ); // compiles to an aligned load with -O0
return _mm256_cvtepi32_ps(intvec);
}
With USE_MOVQ enabled, gcc -O3
(v5.3.0) emits. (So does MSVC)
load_bytes_to_m256(unsigned char*):
vmovq xmm0, QWORD PTR [rdi]
vpmovzxbd ymm0, xmm0
vcvtdq2ps ymm0, ymm0
ret
The stupid vmovq
is what we want to avoid. If you let it use the unsafe loadu_si128
version, it will make good optimized code.
GCC9, clang, and ICC emit:
load_bytes_to_m256(unsigned char*):
vpmovzxbd ymm0, qword ptr [rdi] # ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
vcvtdq2ps ymm0, ymm0
ret
Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128
to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.
Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:
movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E
(or VEX.128.66.0F.W1 6E
) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64-bit integer, _mm_cvtsi64_si128
. (Some compilers don't define it in 32-bit mode.)
movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E
(VEX.128.F3.0F.WIG 7E
) for MOVQ xmm, xmm/m64)
.
The asm ISA ref manual only lists the m128i _mm_mov_epi64(__m128i a)
intrinsic for zeroing the high 64b of a vector while copying it. But the intrinsics guide does list _mm_loadl_epi64(__m128i const* mem_addr)
which has a stupid prototype (pointer to a 16-byte __m128i
type when it really only loads 8 bytes). It is available on all 4 of the major x86 compilers, and should actually be safe. Note that the __m128i*
is just passed to this opaque intrinsic, not actually dereferenced.
The more sane _mm_loadu_si64 (void const* mem_addr)
is also listed, but gcc is missing that one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With