I've been wondering for the longest time, what is the best approach to pass your register types in c++?
In my particular case, I have some layers of abstractions, which in turn call the desired intrinsics. Immintrin functions accept by value (copy), so my guess is it should be a copy. But I'd like to be sure (and satisfy my curiousity).
aka,
__m128 func(__m128 a, __m128 b) {
return _mm_something(a, b);
}
// vs.
__m128 func(const __m128& a, const __m128& b) {
return _mm_something(a, b);
}
You normally want functions like this to inline anyway, in which case it's irrelevant for code-gen; pass by value is simpler syntax so I'd recommend it.
But in the rare cases where you're calling in a way the compiler can't or won't inline1, normally pass by value. Calling conventions like x86-64 System V and Windows vectorcall pass vector args in vector registers (XMM0..7 or YMM0..7).
Windows x64 without vectorcall will transform pass-by-value at the language level into pass-by-reference in the asm. (Prefer vectorcall if you have non-inline functions with SIMD args and you're targeting Windows.)
If the vector is just loaded from memory in the caller, consider passing a float * arg instead of __m128 and having the callee do the load.
Immintrin functions accept by value (copy)
Note that they're not real functions that ever actually get called in asm; they compile to usually one asm instruction, although they can get optimized. e.g. a load intrinsic can fold into a memory operand for an instruction like addps xmm0, [rdi].
Footnote 1: e.g. via function pointer, or to another file in a build without LTO (link-time optimization). Or a large function, although normally vectorization is something you do locally, hidden from the rest of your program, so you can adapt it to different instruction sets without changing the types your program uses all over the place. But you could have one function that's big enough for a compiler to decide not to inline, which takes a vector since the call sites already have a value in a vector that didn't just come from memory.
If a level of abstraction stops the compiler from actually inlining lots of small functions, that's a disaster for SIMD loops; the x86-64 System V calling convention unfortunately has no call-preserved XMM registers, let alone YMM/ZMM, so it would have to spill/reload __m128 locals to the stack around every non-inline call. Windows x64 has too many call-preserved XMM regs, but no call-preserved YMM/ZMM.
That's why dispatch by CPU features (SSE2 vs. SSE4.1 vs. AVX) needs to be for whole loops, not inside an inner loop. Plus of course the call/ret overhead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With