For 64-bit registers, there is the CMOVcc A, B instruction, that only writes B
to A
if condition cc
is satisfied:
; Do rax <- rdx iff rcx == 0
test rcx, rcx
cmove rax, rdx
However, I wasn't able to find anything equivalent for AVX. I still want to move depending on the value of RFLAGS
, just with larger operands:
; Do ymm1 <- ymm2 iff rcx == 0
test rcx, rcx
cmove ymm1, ymm2 (invalid)
Is there an AVX equivalent for cmov
? If not, how can I achieve this operation in a branchless way?
Given this branchy code (which will be efficient if the condition predicts well):
cmp rcx, rdx
jne .nocopy
vmovdqa ymm1, ymm2 ;; copy if RCX==RDX
.nocopy:
We can do it branchlessly by creating a 0 / -1 vector based on the compare condition, and blending on it. Some optimizations vs. the other answer:
vmovd/q xmm, reg
can only run on a single execution port on Intel: port 5, the same one needed by vector shuffles like vpbroadcastq ymm, xmm
.As well as saving 1 total instruction, it makes some of them cheaper (less competition for the same execution port, e.g. scalar xor isn't SIMD at all) and off the critical path (xor-zeroing). And in a loop, you can prepare a zeroed vector outside the loop.
;; inputs: RCX, RDX. YMM1, YMM2
;; output: YMM0
xor rcx, rdx ; 0 or non-0.
vmovq xmm0, rcx
vpxor xmm3, xmm3, xmm3 ; can be done any time, e.g. outside a loop
vcmpeqq xmm0, xmm0, xmm3 ; 0 if RCX!=RDX, -1 if RCX==RDX
vpbroadcastq ymm0, xmm0
vpblendvb ymm0, ymm1, ymm2, ymm0 ; ymm0 = (rcx==rdx) ? ymm2 : ymm1
Destroying the old RCX means you might need a mov
, but this is still worth it.
A condition like rcx >= rdx
(unsigned) could be done with cmp rdx, rcx
/ sbb rax,rax
to materialize a 0 / -1 integer (which you can broadcast without needing vpcmpeqq
).
A signed-greater-than condition is more of a pain; you might end up wanting 2x vmovq
for vpcmpgtq
, instead of cmp
/setg
/vmovd
/ vpbroadcastb
. Especially if you don't have a convenient register to setg
into to avoid a possible false dependency. setg al
/ read EAX isn't a problem for partial register stalls: CPUs new enough to have AVX2 don't rename AL separately from the rest of RAX. (Only Intel ever did that, and doesn't in Haswell.) So anyway, you could just setcc
into the low byte of one of your cmp
inputs.
Note that vblendvps
and vblendvpd
only care about the high byte of each dword or qword element. If you have two correctly sign-extended integers, and subtracting them won't overflow, c - d
will be directly usable as your blend control, just broadcast that. FP blends between integer SIMD instructions like vpaddd
have an extra 1 cycle of bypass latency on input and output, on Intel CPUs with AVX2 (and maybe similar on AMD), but the instruction you save will also have latency.
With unsigned 32-bit numbers, you're likely to have them already zero-extended to 64-bit in integer regs. In that case, sub rcx, rdx
could set the MSB of RCX identically to how cmp ecx, edx
would set CF. (And remember that the FLAGS condition for jb
/ cmovb
is CF == 1
)
;; unsigned 32-bit compare, with inputs already zero-extended
sub rcx, rdx ; sets MSB = (ecx < edx)
vmovq xmm0, rcx
vpbroadcastq ymm0, xmm0
vblendvpd ymm0, ymm1, ymm2, ymm0 ; ymm0 = ecx<edx ? ymm2 : ymm1
But if your inputs are already 64-bit, and you don't know that their range is limited, you'd need a 65-bit result to fully capture a 64-bit subtraction result.
That's why the condition for jl
is SF != OF
, not just a-b < 0
because a-b
is done with truncating math. And the condition for jb
is CF == 1
(instead of the MSB).
While there is no vectorized version of cmov
, one can achieve an equivalent functionality using a bit mask and blending.
Assume we have two 256-bit vectors value1
and value2
, which reside in corresponding vectors registers ymm1
and ymm2
:
align 32
value1: dq 1.0, 2.0, 3.0, 4.0
value2: dq 5.0, 6.0, 7.0, 8.0
; Operands for our conditional move
vmovdqa ymm1, [rel value1]
vmovdqa ymm2, [rel value2]
We want to compare two registers rcx
and rdx
:
; Values to compare
mov rcx, 1
mov rdx, 2
If they are equal, we want to copy ymm2
into ymm1
(and thus select value2
), else we want to keep ymm1
and thus value1
.
Equivalent (invalid) notation using cmov
:
cmp rcx, rdx
cmove ymm1, ymm2 (invalid)
First, we load rcx
and rdx
into vector registers and broadcast them, so they are copied to all 64-bit chunks of the respective register (.
depicts a concatenation):
vmovq xmm0, rcx ; xmm0 <- 0 . rcx
vpbroadcastq ymm1, xmm0 ; ymm1 <- rcx . rcx . rcx . rcx
vmovq xmm0, rdx ; xmm0 <- 0 . rdx
vpbroadcastq ymm2, xmm0 ; ymm2 <- rdx . rdx . rdx . rdx
Then, we generate a mask using vpcmpeqq
:
; If rcx == rdx: ymm0 <- ffffffffffffffff.ffffffffffffffff.ffffffffffffffff.ffffffffffffffff
; If rcx != rdx: ymm0 <- 0000000000000000.0000000000000000.0000000000000000.0000000000000000
vpcmpeqq ymm0, ymm1, ymm2
Finally, we blend ymm2
into ymm1
, using the mask in ymm0
:
; If rcx == rdx: ymm1 <- ymm2
; If rcx != rdx: ymm1 <- ymm1
vpblendvb ymm1, ymm1, ymm2, ymm0
Thanks to @fuz, who outlined this approach in the comments!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With