The "problem" I see with just using an autovectorizer to convert user-written loop-code to SIMD-instructions on every compilation as part of usual optimizations is that if you change the compiler, you cannot be certain that it also auto-vectorizes your code equally well.
Therefore, if you only want to target a single processor, I want to have the compiler generate high-level C code for me, for a specific function, that uses x86 intrinsic wrapper functions that work universally with different compiler vendors.
Is there a Decompiler, or maybe even a compiler-option for GCC that gives me this code?
Not that I know of, but Intel's intrinsics guide is searchable by asm mnemonic. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Filtering out AVX512 often helps making it easier to wade through (because there are a zillion _mask
/ _maskz
for all 3 sizes with AVX512 intrinsics).
The asm manual entries also list mnemonics for each instruction. https://www.felixcloutier.com/x86/index.html
-fverbose-asm
can sometimes help follow variables through asm, but usually after auto-vec everything will have names like tmp1234
. Still, if you're having trouble seeing which pointer is being loaded/stored where, it can help.
You can also get compilers to spit out their internal representations, like LLVM-IR or GIMPLE or RTL, but you can't just look those up in x86 manuals. I already know x86 asm, so I can usually pretty easily see what compilers are doing and translate that to intrinsics by hand. I've actually done this when clang spots something clever that gcc missed, even when the source was already using intrinsics. Or to pure C for scalar code that doesn't auto-vectorize, to hand-hold gcc into doing it clang's way or vice versa.
Compile with -fno-unroll-loops
if you're using clang, to vectorize but not unroll so the asm is less complex. (gcc doesn't unroll by default in the first place).
But note that optimal auto-vectorization choices depend on what target uarch you're tuning for. clang or gcc -O3 -march=znver1
(Zen) will make different code than -march=skylake
. Although often that's just a matter of 128-bit vs. 256-bit vectors, not actually a different strategy unless a different instruction-set being available allows something new. e.g. SSE4.1 has packed 32-bit integer multiply (not widening 32x32 => 64) and fills in a lot of the missing pieces of element sizes and signedness.
It's not necessarily ideal to freeze the vectorization one way by doing it manually, if you're trying to be future-proof with respect to future CPU microarchitectures and extensions as well as compilers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With