Is there any existing instructions which could store lower or higher values from a 256 bit AVX/AVX2(YMM) register to memory address, just like the SSE instruction movlps/movhps does?
Or is there any other way to implement this?
Any help would be appreciated, thanks!
Store the low128 with vmovdqu [rdi], xmm0
.
Store the high128 with VEXTRACTI128 xmm1/m128, ymm2, 1
. Probably you can get a compiler to generate a store to memory by assigning the result of an extract intrinsic to a memory reference.
vextracti128 / f128
takes 2 uops, even in the fused domain (Haswell), so IDK what the point of having it encodable with an immediate operand of 0 is. (until AVX512, when an immediate index instead of a movh
becomes relevant, since they didn't know they were going to replace VEX with EVEX for AVX512). There's no penalty for mixing AVX2 with xmm regs and AVX2 with ymm regs, so you can just use a 128b store of the xmm version to get the low 128, just like you can get the low32 of a 64b GP reg by referencing eax
instead of rax
.
It's probably annoying to cast stuff when using intrinsics, so with luck a compiler will compile _mm256_extracti128_si256 (vec, 0)
to a vmovdqu
of the corresponding xmm reg. But if your compiler doesn't, your code will be faster if you get it to generate vmovdqu
. (movdqu
is as fast as vmovdqa
if the address is aligned, just like non-mov AVX memory access.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With