Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can AVX2-compiled program still use 32 registers of an AVX-512 capable CPU?

Assuming AVX2-targeted compilation and with C++ intrinsics, if I write an nbody algorithm using 17 registers per body-body computation, can 17th register be indirectly(register rename hardware) or directly(visual studio compiler, gcc compiler) be mapped on an AVX-512 register to cut memory dependency off? For example, skylake architecture has 1 or 2 AVX-512 fma units. Does this number change total registers available too? (specifically, a xeon silver 4114 cpu)

If this works, how does it work? 1st hardware thread using first half of each ZMM vector and 2nd hardware thread using second half of each ZMM vector when all instructions are AVX2 or less?


Edit: What if there will be online-compilation on target machine(with OpenCL, for example)? Can drivers do above register usage for me?

like image 525
huseyin tugrul buyukisik Avatar asked Dec 03 '22 20:12

huseyin tugrul buyukisik


1 Answers

TL:DR: compile with -march=skylake-avx512 to let the compiler use EVEX prefixes to access ymm16-31 so it can (hopefully) make better asm for code that has 17 __m256 values "live" at once.

-march=skylake-avx512 includes -mavx512vl


For example, skylake architecture has 1 or 2 AVX-512 fma units. Does this number change total registers available too?

No, the physical register file is the same size in all Skylake CPUs, regardless of how many FMA execution units are present. These things are totally orthogonal.

The number of architectural YMM registers is 16 for 64-bit AVX2, and 32 for 64-bit AVX512VL. In 32-bit code, there are always only 8 vector registers available, even with AVX512. (So 32-bit is very obsolete for most high-performance computing.)

The longer EVEX encoding required for YMM16-31 with AVX512VL1 + AVX2, but instructions with all operands in the low 16 can use the shorter VEX prefix AVX/AVX2 form of the instruction. (There's no penalty for mixing VEX and EVEX encodings, so VEX is preferable for code-size. But if you avoid y/zmm0-y/zmm15, you don't need VZEROUPPER; legacy-SSE instructions can't touch xmm16-31 so there's no possible problem.)

Again, none of this has anything to do with the amount of FMA execution units present.

Footnote 1: AVX512F only includes the ZMM versions of most instructions; you need AVX512VL for the EVEX encoding of most YMM instructions. The only CPUs with AVX512F but not AVX512VL are Xeon Phi, KNL / KNM, now discontinued; all mainstream CPUs support xmm/ymm versions of all the AVX512 instructions they support.

if I write an nbody algorithm using 17 registers per body-body computation, can 17th register be indirectly(register rename hardware) mapped

No, this not how CPUs and machine code work. In machine code, there's only a 4-bit (without using AVX512-only encodings) or 5-bit (with AVX512 encodings) field to specify a register operand for an instruction.

If your code needs 17 vector values to be "live" at once, the compiler will have to emit instructions to spill/reload one of them when targeting x86-64 AVX2, which architecturally only has 16 YMM registers. i.e. it has 16 different names which the CPU can rename onto its larger internal register file.

If register renaming solved the whole problem, x86-64 wouldn't have bothered increasing the number of architectural registers from 8 integer / 8 xmm to 16 integer / 16 xmm.

This is why AVX512 spent 3 extra bits (1 each for dst, src1, and src2) to allow access to 32 architectural vector registers beyond what VEX prefixes can encode. (Only in 64-bit mode; 32-bit mode still only has 8. In 32-bit mode, VEX and EVEX prefixes are invalid encodings of existing instructions, and flipping those extra register-number bits would make them decode as valid encodings of those old instructions instead of as prefixes.)


Register renaming allows reuse of the same architectural register for a different value without any false dependency. i.e. it avoids WAR and WAW hazards; it's part of the "magic" that makes out-of-order execution work. It helps keep more value in flight when considering ILP and out-of-order execution, but it doesn't help you have more values in architectural registers at any point in simple program order of execution.

For example, the following loop only needs 3 architectural registers, and each iteration is independent (no loop-carried dependency, other than the pointer-increment).

.loop:
    vaddps   ymm0, ymm1, [rsi]  ; ymm0 = ymm1, [src]
    vmulps   ymm0, ymm0, ymm2   ; ymm0 *= ymm2
    vmovaps  [rsi+rdx], ymm0    ; dst = src + (dst_start - src_start).  Stays micro-fused on Haswell+

    add      rsi, 32
    cmp      rsi, rcx   ; }while(rsi < end_src)
    jb   .loop

But with an 8-cycle latency chain from the first write of ymm0 to the last read within an iteration (Skylake addps / mulps are 4 cycles each), it would bottleneck on that, on a CPU without register renaming. The next iteration couldn't write to ymm0 until the vmovaps in this iteration had read the value.

But on an out-of-order CPU, multiple iterations are in-flight at once, with each write to ymm0 renamed to write a different physical register. Ignoring the front-end bottleneck (pretend we unrolled), the CPU can keep enough iterations in flight to saturate the FMA unit with 2 addps/mulps uops per clock, using about 8 physical registers. (Or more, because they can't actually be freed until retirement, not just as soon as the last uop has read that value).

The limited physical register file size can be the limit on the out-of-order windows size, instead of the ROB or scheduler size.

(We thought for a while that Skylake-AVX512 uses 2 PRF entries for a ZMM register, based on this result, but later more detailed experiments revealed that AVX512 mode powers up a wider PRF, or upper lanes to complement the existing PRF, so SKX in AVX512 mode still has the same number of 512-bit physical registers as 256-bit physical registers. See discussion between @BeeOnRope and @Mysticial. I think there was a better write-up of an experiment + results somewhere but I can't find it ATM.)


Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (answer: it doesn't; the OP was confused about register-reuse. My answer explains in lots of detail, with some interesting performance experiments with multiple vector accumulators.)

like image 143
Peter Cordes Avatar answered Dec 14 '22 22:12

Peter Cordes