I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.
One area I'd like to improve is a very hot loop:
__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
for (int i = 0; i < 16; ++i)
{
auto someIndex = arr[i];
output[i] = someGlobalArray[someIndex];
}
for (int i = 0; i < 16; ++i)
{
if (output[i] == 1)
{
return i;
}
}
return -1;
}
But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.
If I had to vectorize this by hand, it would be:
(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0
(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth
(3) compare zmm0 and zmm1
(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched
First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?
Here is what clang generates:
0000000000400530 <_Z5superPiS_S_>:
400530: 48 63 07 movslq (%rdi),%rax
400533: 8b 04 86 mov (%rsi,%rax,4),%eax
400536: 89 02 mov %eax,(%rdx)
400538: 48 63 47 04 movslq 0x4(%rdi),%rax
40053c: 8b 04 86 mov (%rsi,%rax,4),%eax
40053f: 89 42 04 mov %eax,0x4(%rdx)
400542: 48 63 47 08 movslq 0x8(%rdi),%rax
400546: 8b 04 86 mov (%rsi,%rax,4),%eax
400549: 89 42 08 mov %eax,0x8(%rdx)
40054c: 48 63 47 0c movslq 0xc(%rdi),%rax
400550: 8b 04 86 mov (%rsi,%rax,4),%eax
400553: 89 42 0c mov %eax,0xc(%rdx)
400556: 48 63 47 10 movslq 0x10(%rdi),%rax
40055a: 8b 04 86 mov (%rsi,%rax,4),%eax
40055d: 89 42 10 mov %eax,0x10(%rdx)
400560: 48 63 47 14 movslq 0x14(%rdi),%rax
400564: 8b 04 86 mov (%rsi,%rax,4),%eax
400567: 89 42 14 mov %eax,0x14(%rdx)
40056a: 48 63 47 18 movslq 0x18(%rdi),%rax
40056e: 8b 04 86 mov (%rsi,%rax,4),%eax
400571: 89 42 18 mov %eax,0x18(%rdx)
400574: 48 63 47 1c movslq 0x1c(%rdi),%rax
400578: 8b 04 86 mov (%rsi,%rax,4),%eax
40057b: 89 42 1c mov %eax,0x1c(%rdx)
40057e: 48 63 47 20 movslq 0x20(%rdi),%rax
400582: 8b 04 86 mov (%rsi,%rax,4),%eax
400585: 89 42 20 mov %eax,0x20(%rdx)
400588: 48 63 47 24 movslq 0x24(%rdi),%rax
40058c: 8b 04 86 mov (%rsi,%rax,4),%eax
40058f: 89 42 24 mov %eax,0x24(%rdx)
400592: 48 63 47 28 movslq 0x28(%rdi),%rax
400596: 8b 04 86 mov (%rsi,%rax,4),%eax
400599: 89 42 28 mov %eax,0x28(%rdx)
40059c: 48 63 47 2c movslq 0x2c(%rdi),%rax
4005a0: 8b 04 86 mov (%rsi,%rax,4),%eax
4005a3: 89 42 2c mov %eax,0x2c(%rdx)
4005a6: 48 63 47 30 movslq 0x30(%rdi),%rax
4005aa: 8b 04 86 mov (%rsi,%rax,4),%eax
4005ad: 89 42 30 mov %eax,0x30(%rdx)
4005b0: 48 63 47 34 movslq 0x34(%rdi),%rax
4005b4: 8b 04 86 mov (%rsi,%rax,4),%eax
4005b7: 89 42 34 mov %eax,0x34(%rdx)
4005ba: 48 63 47 38 movslq 0x38(%rdi),%rax
4005be: 8b 04 86 mov (%rsi,%rax,4),%eax
4005c1: 89 42 38 mov %eax,0x38(%rdx)
4005c4: 48 63 47 3c movslq 0x3c(%rdi),%rax
4005c8: 8b 04 86 mov (%rsi,%rax,4),%eax
4005cb: 89 42 3c mov %eax,0x3c(%rdx)
4005ce: c3 retq
4005cf: 90 nop
Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.
gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)
gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.
Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)
If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle.
But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)
A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)
;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required
vmovdqu32 zmm16, [rcx/arr] ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that
kxnorw k1, k0,k0 ; k1 = -1. k0 false dep is likely not a problem.
; optional: vpxord xmm17, xmm17, xmm17 ; break merge-masking false dep
vpgatherdd zmm17{k1}, [rdx + zmm16 * 4] ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done
vmovdqu32 [r8/output], zmm17
vpcmpd k1, zmm17, zmm31, 0 ; 0->EQ. Outside the loop, do zmm31=set1_epi32(1)
; k1 = compare bitmap
kortestw k1, k1
jz .not_found ; early check for not-found
kmovw edx, k1
; tzcnt doesn't have a false dep on the output on Skylake
; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt eax, edx ; bit-scan for the first (lowest-address) set element
; input=0 produces output=32
; or avoid the branch and let 32 be the not-found return value.
; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret
.not_found:
mov eax, -1
ret
You can do this yourself with intrinsics:
Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.
#include <immintrin.h>
//__declspec(noinline) // I *hope* this was just to see the stand-alone asm version
// but it means the output array can't optimize away at all
//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
__m512i vindex = _mm512_load_si512(arr);
__m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4); // indexing by 4-byte int
*output = gather;
__mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
// Intrinsics make masks freely convert to integer
// even though it costs a `kmov` instruction either way.
int onepos = _tzcnt_u32(cmp);
if (onepos >= 16){
return -1;
}
return onepos;
}
All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.
gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.
All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.
Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With