Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"?

Which is generally faster to test the byte in AL for zero / non-zero?

  • TEST EAX, EAX
  • TEST AL, AL

Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about.

So AL=EAX and there are no partial-register penalties for reading EAX.

Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access of a >32-bit register.

Any info/details appreciated, thanks!

like image 376
John Doe Avatar asked Aug 01 '19 19:08

John Doe


1 Answers

Code-size is equal, and so is performance on all x86 CPUs AFAIK.

Intel CPUs (with partial-register renaming) definitely don't have a penalty for reading AL after writing EAX. Other CPUs also have no penalty for reading low-byte registers.

Reading AH would have a penalty on Intel CPUs, like some extra latency. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)

In general 32-bit operand-size and 8-bit operand size (with low-8 not high-8) are equal speed except for the false-dependencies or later partial-register reading penalties of writing an 8-bit register. Since TEST only reads registers, this can't be a problem. Even add al, bl is fine: the instruction already had an input dependency on both registers, and on Sandybridge-family a RMW to the low byte of a register doesn't rename it separately. (Haswell and later don't rename low-byte registers separately anyway).

Pick whichever operand-size you like. 8-bit and 32-bit are basically equal. The choice is just a matter of human readability. If you're going to work with the value as a 32-bit integer later, then go 32-bit. If it's logically still an 8-bit value and you were only using movzx as the x86 equivalent of ARM ldrb or MIPS lbu, then using 8-bit makes sense.

There are code-size advantages to instructions like cmp al, imm which can use the no-modrm short-form encoding. cmp al, 0 is still worse than test al,al on some old CPUs (Core 2), where cmp/jcc macro-fusion is less flexible than test/jcc macro-fusion. (Test whether a register is zero with CMP reg,0 vs OR reg,reg?)


There is one difference between these instructions: test al,al sets SF according to the high bit of AL (which can be non-zero). test eax,eax will always clear SF. If you only care about ZF then that makes no difference, but if you have a use for the high bit in SF for a later branch or cmovcc/setcc then you can avoid doing a 2nd test.


Other ways to test a byte in memory:

If you're consuming the flag result with setcc or cmovcc, not a jcc branch, then macro-fusion doesn't matter in the discussion below.

If you also need the actual value in a register later, movzx/test/jcc is almost certainly best. Otherwise you can consider a memory-destination compare.

cmp [mem], immediate can micro-fuse into a load+cmp uop on Intel, as long as the addressing mode is not RIP-relative. (On Sandybridge-family, indexed addressing modes will un-laminate even on Haswell and later: See Micro fusion and addressing modes). Agner Fog doesn't mention whether AMD has this limitation for fusing cmp/jcc with a memory operand.

;;; no downside for setcc or cmovcc, only with JCC on Intel
;;; unknown on AMD
    cmp byte [esp+4], 0       ; micro-fuses into load+cmp with this addressing mode
    jnz   ...                 ; breaks macro-fusion on SnB-family

I don't have an AMD CPU to test whether Ryzen or any other AMD still fuses cmp/jcc when the cmp is mem, immediate. Modern AMD CPUs do in general do cmp/jcc and test/jcc fusion. (But not add/sub/and/jcc fusion like SnB-family).

cmp mem,imm / jcc (vs. movzx/test+jcc):

  • smaller code-size in bytes
  • same number of front-end / fused-domain uops (2) on mainstream Intel. This would be 3 front-end uops if micro-fusion of the cmp+load wasn't possible, e.g. with a RIP-relative addressing mode + immediate. Or on Sandybridge-family with an indexed addressing mode, it would unlaminate to 3 uops after decode but before issuing into the back-end.

    Advantage: this is still 2 on Silvermont/Goldmont / KNL or very old CPUs without macro-fusion. The main advantage of movzx/test/jcc over this is macro-fusion, so it falls behind on CPUs where that doesn't happen.

  • 3 back-end uops (unfused domain = execution ports and space in the scheduler aka RS) because cmp-immediate can't macro-fuse with a JCC on Intel Sandybridge-family CPUs (tested on Skylake). The uops are load, cmp, and a separate branch uop. (vs. 2 for movzx / test+jcc). Back-end uops usually aren't a bottleneck directly, but if the load isn't ready for a while it takes up more space in the RS, limiting how much further past this out-of-order execution can see.

cmp [mem], reg / jcc can macro + micro-fuse into a single compare+branch uop so it's excellent. If you need a zeroed register for anything later in your function, do xor-zero it first and use it for a single-uop compare+branch on memory.

    movzx  eax, [esp+4]       ; 1 uop (load-port only on Intel and Ryzen)
    test   al,al                     ; fuses with jcc
    jnz    ...                ; 1 uop

This is still 2 uops for the front-end but only 2 for the back-end as well. The test/jcc macro-fuse together. It costs more code-size, though.

If you aren't branching but instead using the FLAGS result for cmovcc or setcc, using cmp mem, imm has no downside. It can micro-fuse as long as you don't use a RIP-relative addressing mode (which always blocks micro-fusion when there's also an immediate), or an indexed addressing mode.

like image 74
Peter Cordes Avatar answered Oct 08 '22 21:10

Peter Cordes