Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test whether a register is zero with CMP reg,0 vs OR reg,reg?

Is there any execution speed difference using the following code:

cmp al, 0
je done

and the following:

or al, al
jz done

I know that the JE and JZ instructions are the same, and also that using OR gives a size improvement of one byte. However, I am also concerned with code speed. It seems that logical operators will be faster than a SUB or a CMP, but I just wanted to make sure. This might be a trade-off between size and speed, or a win-win (of course the code will be more opaque).

like image 772
sadljkfhalskdjfh Avatar asked Nov 15 '15 15:11

sadljkfhalskdjfh


2 Answers

Yes, there is a difference in performance.

The best choice for comparing a register with zero is test reg, reg. It sets FLAGS the same way cmp reg,0 would, and is at least as fast1 as any other way, with smaller code-size.

(Even better is when ZF is already set appropriately by the instruction that set reg so you can just branch, setcc, or cmovcc directly. For example, the bottom of a normal loop often looks like dec ecx / jnz .loop_top. Most x86 integer instructions "set flags according to the result", including ZF=1 if the output was 0.).

or reg,reg can't macro-fuse with a JCC into a single uop on any existing x86 CPUs, and adds latency for anything that later reads reg because it rewrites the value into the register. cmp's downside is usually just code-size.

Footnote 1: There is a possible exception, but only on obsolete P6-family CPUs (Intel up to Nehalem, replaced by Sandybridge-family in 2011). See below about avoiding register-read stalls by rewriting the same value into a register. Other microarchitecture families don't have such stalls, and there's never any upside to or over test.


The FLAGS results of test reg,reg / and reg,reg / or reg,reg are
identical to cmp reg, 0 in all cases (except for AF) because:

  • CF = OF = 0 because test/and always do that, and for cmp because subtracting zero can't overflow or carry.
  • ZF, SF, PF set according to the result (i.e. reg): reg&reg for test, or reg - 0 for cmp.

(AF is undefined after test, but set according to the result for cmp. I'm ignoring it because it's really obscure: the only instructions that read AF are the ASCII-adjust packed-BCD instructions like AAS, and lahf / pushf.)

You can of course check conditions other than reg == 0 (ZF), e.g. test for negative signed integers by looking at SF. But fun fact: jl, the signed less-than condition, is more efficient than js on some CPUs after a cmp. They're equivalent after compare against zero because OF=0 so the l condition (SF!=OF) is equivalent to SF.

Every CPU that can macro-fuse TEST/JL can also macro-fuse TEST/JS, even Core 2. But after CMP byte [mem], 0, always use JL not JS to branch on the sign bit because Core 2 can't macro-fuse that. (At least in 32-bit mode; Core 2 can't macro-fuse at all in 64-bit mode).

The signed-compare conditions also let you do stuff like jle or jg, looking at ZF as well as SF!=OF.


test is shorter to encode than cmp with immediate 0, in all cases except the cmp al, imm8 special case which is still two bytes.

Even then, test is preferable for macro-fusion reasons (with jle and similar on Core2), and because having no immediate at all can possibly help uop-cache density by leaving a slot that another instruction can borrow if it needs more space (SnB-family).


Macro-fusion of test/jcc into a single uop in the decoders

The decoders in Intel and AMD CPUs can internally macro-fuse test and cmp with some conditional branch instructions into a single compare-and-branch operation. This gives you a max throughput of 5 instructions per cycle when macro-fusion happens, vs. 4 without macro-fusion. (For Intel CPUs since Core2.)

Recent Intel CPUs can macro-fuse some instructions (like and and add/sub) as well as test and cmp, but or is not one of them. AMD CPUs can only merge test and cmp with a JCC. See x86_64 - Assembly - loop conditions and out of order, or just refer directly to Agner Fog's microarch docs for the details of which CPU can macro-fuse what. test can macro-fuse in some cases where cmp can't, e.g. with js.

Almost all simple ALU ops (bitwise boolean, add/sub, etc.) run in a single cycle. They all have the same "cost" in tracking them through the out-of-order execution pipeline. Intel and AMD spend the transistors to make fast execution units to add/sub/whatever in a single cycle. Yes, bitwise OR or AND is simpler, and probably uses slightly less power, but still can't run any faster than one clock cycle.


or reg, reg adds another cycle of latency to the dependency chain for following instructions that need to read the register. It's an x |= x in the chain of operations that lead to the value you want.


You might think that extra register write would also need an extra physical register-file (PRF) entry vs. test, but that's probably not the case. (See https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about PRF capacity impact on out-of-order exec).

test has to produce its FLAGS output somewhere. On Intel Sandybridge-family CPUs at least, when an instruction produces a register and a FLAGS result, both of them are stored together in the same PRF entry. (Source: an Intel patent I think. This is from memory but seems like an obviously sane design.)

An instruction like cmp or test that only produces a FLAGS result also needs a PRF entry for its output. Presumably this is slightly worse: the old physical register is still "alive", referenced as the holder of the value of the architectural register written by some older instruction. And now architectural EFLAGS (or more specifically, both the separate-renamed CF and SPAZO flag groups) point to this new physical register in the RAT (register allocation table) updated by the renamer. Of course, the next FLAGS-writing instruction will overwrite that, allowing that PR to be freed once all its readers have read it and executed. This is not something I think about when optimizing, and I don't think tends to matter in practice.


P6-family register-read stalls: possible upside to or reg,reg

P6-family CPUs (PPro / PII to Nehalem) have a limited number of register-read ports for the issue/rename stage to read "cold" values (not forwarded from an in-flight instruction) from the permanent register file, but recently-written values are available directly from the ROB. Rewriting a register unnecessarily can make it live in the forwarding network again to help avoid register-read stalls. (See Agner Fog's microarch pdf).

Re-writing a register with the same value on purpose to keep it "hot" can actually be an optimization for some cases of surrounding code, on P6. Early P6 family CPUs couldn't do macro-fusion at all, so you aren't even missing out on that by using and reg,reg instead of test. But Core 2 (in 32-bit mode) and Nehalem (in any mode) can macro-fuse test/jcc so you're missing out on that.

(and is equivalent to or for this purpose on P6-family, but less bad if your code ever runs on a Sandybridge-family CPU: it can macro-fuse and/jcc but not or/jcc. The extra cycle of latency in the dep-chain for the register is still a disadvantage on P6, especially if the critical path involving it is the main bottleneck.)

P6 family is very much obsolete these days (Sandybridge replaced it in 2011), and CPUs before Core 2 (Core, Pentium M, PIII, PII, PPro) are very obsolete and getting into retrocomputing territory, especially for anything where performance matters. You can ignore P6-family when optimizing unless you have a specific target machine in mind (e.g. if you have a crusty old Nehalem Xeon machine) or you're tuning a compiler's -mtune=nehalem settings for the few users still left.

If you're tuning something to be fast on Core 2 / Nehalem, use test unless profiling shows that register-read stalls are a big problem in a specific case, and using and actually fixes it.

On earlier P6-family, and reg,reg might be ok as your default code-gen choice when the value isn't part of a problematic loop-carried dep chain, but is read later. Or if it is, but there's also a specific register-read stall that you can fix with and reg,reg.

If you only want to test the low 8 bits of a full register, test al,al avoids writing a partial-register, which on P6-family is renamed separately from the full EAX/RAX. or al,al is much worse if you later read EAX or AX: partial-register stall on P6-family.(Why doesn't GCC use partial registers?)


History of the unfortunate or reg,reg idiom

The or reg,reg idiom may have came from 8080 ORA A, as pointed out in a comment.

8080's instruction set doesn't have a test instruction, so your choices for setting flags according to a value included ORA A and ANA A. (Notice that the A register destination is baked in to the mnemonic for both those instructions, and there aren't instructions to OR into different registers: it's a 1-address machine except for mov, while 8086 is a 2-address machine for most instructions.)

8080 ORA A was the usual go-to way to do it, so presumably that habit carried over into 8086 assembly programming as people ported their asm sources. (Or used automatic tools; 8086 was intentionally designed for easy / automatic asm-source porting from 8080 code.)

This bad idiom continues to be blindly used by beginners, presumably taught by people who learned it back in the day and passed it on without thinking about the obvious critical path latency downside for out-of-order execution. (Or the other more subtle problems like no macro-fusion.)


Delphi's compiler reportedly uses or eax,eax, which was maybe a reasonable choice at the time (before Core 2), assuming that register-read stalls were more important than lengthening the dep chain for whatever reads it next. IDK if that's true or they were just using the ancient idiom without thinking about it.

Unfortunately, compiler-writers at the time didn't know the future, because and eax,eax performs exactly equivalently to or eax,eax on Intel P6-family, but is less bad on other uarches because and can macro-fuse on Sandybridge-family. (See the P6 section above).


Value in memory: maybe use cmp or load it into a reg.

To test a value in memory, you can cmp dword [mem], 0, but Intel CPUs can't macro-fuse flag-setting instructions that have both an immediate and a memory operand. If you're going to use the value after the compare in one side of the branch, you should mov eax, [mem] / test eax,eax or something. If not, either way is 2 front-end uops, but it's a tradeoff between code-size and back-end uop count.

Although note that some addressing modes won't micro-fuse either on SnB-family: RIP-relative + immediate won't micro-fuse in the decoders, or an indexed addressing mode will un-laminate after the uop-cache. Either way leading to 3 fused-domain uops for cmp dword [rsi + rcx*4], 0 / jne or [rel some_static_location].

On i7-6700k Skylake (tested with perf events uops_issued.any and uops_executed.thread):

  • mov reg, [mem] (or movzx) + test reg,reg / jnz 2 uops in both fused and unfused domains, regardless of addressing mode, or movzx instead of mov. Nothing to micro-fuse; does macro-fuse.
  • cmp byte [rip+static_var], 0 + jne. 3 fused, 3 unfused. (front and back ends). The RIP-relative + immediate combination prevents micro-fusion. It also doesn't macro-fuse. Smaller code-size but less efficient.
  • cmp byte [rsi + rdi], 0 (indexed addr mode) / jne 3 fused, 3 unfused. Micro-fuses in the decoders, but un-laminates at issue/rename. Doesn't macro-fuse.
  • cmp byte [rdi + 16], 0 + jne 2 fused, 3 unfused uops. Micro-fusion of cmp load+ALU did happen because of the simple addressing mode, but the immediate prevents macro-fusion. About as good as load + test + jnz: smaller code-size but 1 extra back-end uop.

If you have a 0 in a register (or a 1 if you want to compare a bool), you can cmp [mem], reg / jne for even fewer uops, as low as 1 fused-domain, 2 unfused. But RIP-relative addressing modes still don't macro-fuse.

Compilers tend to use load + test/jcc even when the value isn't used later.

You could also test a value in memory with test dword [mem], -1, but don't. Since test r/m16/32/64, sign-extended-imm8 isn't available, it's worse code-size than cmp for anything larger than bytes. (I think the design idea was that if you you only want to test the low bit of a register, just test cl, 1 instead of test ecx, 1, and use cases like test ecx, 0xfffffff0 are rare enough that it wasn't worth spending an opcode. Especially since that decision was made for 8086 with 16-bit code, where it was only the difference between an imm8 and imm16, not imm32.)

(I wrote -1 rather than 0xFFFFFFFF so it would be the same with byte or qword. ~0 would be another way to write it.)

Related:

  • What is instruction fusion in contemporary x86 processors? (micro- and macro-fusion). TODO: move the test results there (and update my answer there to fix some things that don't match my current results.)
  • x86_64 - Assembly - loop conditions and out of order (which instructions can macro-fusion on Sandybridge-family)
like image 75
Peter Cordes Avatar answered Oct 14 '22 18:10

Peter Cordes


It depends on the exact code sequence, which specific CPU it is, and other factors.

The main problem with or al, al, is that it "modifies" EAX, which means that a subsequent instruction that uses EAX in some way may stall until this instruction completes. Note that the conditional branch (jz) also depends on the instruction, but CPU manufacturers do a lot of work (branch prediction and speculative execution) to mitigate that. Also note that in theory it would be possible for a CPU manufacturer to design a CPU that recognises EAX isn't changed in this specific case, but there are hundreds of these special cases and the benefits of recognising most of them are too little.

The main problem with cmp al,0 is that it's slightly larger, which might mean slower instruction fetch/more cache pressure, and (if it is a loop) might mean that the code no longer fits in some CPU's "loop buffer".

As Jester pointed out in comments; test al,al avoids both problems - it's smaller than cmp al,0 and doesn't modify EAX.

Of course (depending on the specific sequence) the value in AL must've come from somewhere, and if it came from an instruction that set flags appropriately it might be possible to modify the code to avoid using another instruction to set flags again later.

like image 33
Brendan Avatar answered Oct 14 '22 19:10

Brendan