Setting and clearing the zero flag in x86

Tags:

What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64?

Methods that work without the need for a register with a known value, or without any free registers at all are preferred, but if a better method is available when those or other assumptions are true it is also worth mentioning.

475

asked Feb 03 '19 02:02

BeeOnRope

2 Answers

ZF=0

This is harder. cmp between any two regs known to be not equal. Or cmp reg,imm with any value some reg couldn't possibly have. e.g. cmp reg,1 with any known-zero register.

In general test reg,reg is good with any known-non-0 register value, e.g. a pointer.
test rsp, rsp is probably a good choice, or even test esp, esp to save a byte will work except if your stack is in the unusual location of spanning the 4G boundary.

I don't see a way to create ZF=0 in one instruction without a false dependency on some input reg. xor eax,eax / inc eax or dec will do the trick in 2 uops if you don't mind destroying a register, breaking false dependencies. (not doesn't set FLAGS, and neg will just do 0-0 = 0.)

or eax, -1 doesn't need any pre-condition for the register value. (False dependency, but not a true dependency so you can pick any register even if it might be zero.) It doesn't have to be -1, it's not gaining you anything so if you can make it something useful so much the better.

or eax,-1 FLAG results: ZF=0 PF=1 SF=1 CF=0 OF=0 (AF=undefined).

If you need to do this in a loop, you can obviously set up for it outside the loop, if you can dedicate a register to being non-zero for use with test.

ZF=1

Least destructive: cmp eax,eax - but has a false dependency (I assume) and needs a back-end uop: not a zeroing idiom. RSP doesn't usually change much so cmp esp, esp could be a good choice. (Unless that forces a stack-sync uop).

Most efficient: xor-zeroing (like xor eax,eax using any free register) is definitely the most efficient way on SnB-family (same cost as a 2-byte nop, or 3-byte if it needs a REX because you want to zero one of r8d..r15d): 1 front-end uop, zero back-end uops on SnB-family, and the FLAGS result is ready in the same cycle it issues. (Relevant only in case the front-end was stalled, or some other case where a uop depending on it issues in the same cycle and there aren't any older uops in the RS with ready inputs, otherwise such uops would have priority for whichever execution port.)

Flag results: ZF=1 PF=1 SF=0 CF=0 OF=0 (AF=undefined). (Or use sub eax,eax to get well-defined AF=0. In practice modern CPUs pick AF=0 for xor-zeroing, too, so they can decode both zeroing idioms the same way. Silvermont only recognizes 32-bit operand-size xor as a zeroing idiom, not sub.)

xor-zero is very cheap on all other uarches as well, of course: no input dependencies, and doesn't need any pre-existing register value. (And thus doesn't contribute to P6-family register-read stalls). So it will be at worst tied with anything else you could do on any other uarch (where it does require an execution unit.)

(On early P6-family, before Pentium M, xor-zeroing does not break dependencies; it only triggers the special al=eax state that avoids partial-register stuff. But none of those CPUs are x86-64, all 32-bit only.)

It's pretty common to want a zeroed register for something anyway, e.g. as a sub destination for 0 - x to copy-and-negate, so take advantage of it by putting the xor-zeroing where you need it to also create a useful FLAG condition.

Interesting but probably not useful: test al, 0 is 2 bytes long. But so is cmp esp,esp.

As @prl suggested, cmp same,same with any register will work without disturbing a value. I suspect this is not special-cased as dependency breaking the way sub same,same is on some CPUs, so pick a "cold" register. Again 2 or 3 bytes, 1 uop. It can micro-fuse with a JCC, but that would be dumb (unless the JCC is also a branch target from some other condition?)

Flag results: same as xor-zeroing.

Downsides:

(probably) false dependency
on P6-family can contribute to a register-read stall, so pick a cold register you're already reading in nearby instructions.
needs a back-end execution unit on SnB-family

Just for fun, other as-cheap alternatives include test al, 0. 2 bytes for AL, 3 or 4 bytes for any other 8-bit register. (REX) + opcode + modrm + imm8. The original register value doesn't matter because an imm8 of zero guarantees that reg & 0 = 0.

If you happen to have a 1 or -1 in a register you can destroy, 32-bit mode inc or dec would set ZF in only 1 byte. But in x86-64 that's at least 2 bytes. Nothing comes to mind for a 1-byte instruction in 64-bit mode that's actually efficient and sets FLAGS.

ZF=!CF

sbb same,same can set ZF=!CF (leaving CF unmodified), and setting the reg to 0 (CF=0) or -1 (CF=1). On AMD since Bulldozer (BD-family and Zen-family), this has no dependency on the GP register, only CF. But on other uarches it's not special cased and there is a false dep on the reg. And it's 2 uops on Intel before Broadwell.

ZF=bool(integer register)

To set ZF=integer_reg, obviously the normal test reg,reg is your best bet. (Better than and reg,reg or or reg,reg, unless you're intentionally rewriting the register to avoid P6 register-read stalls.)

Other FLAGS conditions:

How to read and write x86 flags registers directly?
One instruction to clear PF (Parity Flag) -- get odd number of bits in result register (not possible without pre-existing register values for test or cmp).
How can I set or clear overflow flag in x86 assembly? (e.g. for the start of an ADOX chain.)
pushf/pop rax is not terrible, but writing flags with popf is very slow (e.g. 1/20c throughput on SKL). It's microcoded because flags like IF also live in EFLAGS, and there isn't a condition-codes-only version or a special fast-path for user-space. (Or maybe 20c is the fast path.)
lahf (FLAGS->AH) / sahf (AH->FLAGS) can be useful but miss OF.

CF has clc/stc/cmc instructions. (clc is as efficient as xor-zeroing on SnB-family.)

answered Nov 26 '22 08:11

Peter Cordes

Assuming you don’t need to preserve the values of the other flags,

cmp eax, eax

answered Nov 26 '22 06:11

prl

Related questions
                            
                                draw over 4K polylines in android google maps
                            
                                Causes of clr ! JIT_New in PerfView CPU stack
                            
                                How to do something n times per second?
                            
                                Performance of uu-parsinglib compared to "try" in Parsec
                            
                                Would it benefit to pre-compile jade templates on production in express
                            
                                Is there a non-blocking method to check for data faster than select() and poll()?
                            
                                Vector performance in Go and C++
                            
                                Python faster alternative to dictionary? [duplicate]
                            
                                How to speed up or vectorize a for loop?
                            
                                Write double (triple) sum as inner product?
                            
                                How to increase the request per second on amazon EC2 T2.micro instance?
                            
                                Why does malloc rely on mmap starting from a certain threshold?
                            
                                SQL query that can find Typos in Arabic language
                            
                                Faster lookup tables using AVX2
                            
                                Azure data factory copy activity from Storage to SQL: hangs at 70000 rows
                            
                                Push_back faster than insert?
                            
                                Understanding Memory Pools
                            
                                Big-O What are the constants k and n0 in the formal definition of the order of an algorithm?
                            
                                Why do I see 400x outlier timings when calling clock_gettime repeatedly?
                            
                                clflush to invalidate cache line via C function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Setting and clearing the zero flag in x86

Tags:

performance

x86

assembly

x86-64

micro-optimization