Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

zero assignment versus xor, is the second really faster?

someone showed me a few years ago the following command to zero a variable.

xor i,i

He told me that this is faster than just assigning zero to it. Is it true? Do compilers do optimization to get the code to perform such a thing?

like image 249
stdcall Avatar asked Oct 08 '11 07:10

stdcall


2 Answers

You can try this yourself to see the answer:

  movl $0,%eax
  xor %eax,%eax

assemble then disassemble:

as xor.s -o xor.o
objdump -D xor.o

And get

   0:   b8 00 00 00 00          mov    $0x0,%eax
   5:   31 c0                   xor    %eax,%eax

the mov instruction for a 32 bit register is 2.5 times larger, takes longer to load from ram and consumes that much more cache space. Back in the day the load time alone was a killer, today the memory cycle time and cache space could be argued to be not that noticeable, but it is if your compiler and/or code does this too often you will see the loss of cache space and or more evictions, and more, slow, system memory cycles.

In modern CPUs, larger code-size can also slow down the decoders, maybe preventing them from decoding their maximum number of x86 instructions per cycle. (e.g. up to 4 instructions in a 16B block for some CPUs.)

There are also performance advantages to xor over mov in some x86 CPUs (especially Intel's) that have nothing to do with code-size, so xor-zeroing is always preferred in x86 assembly.


Another set of experiments:

void fun1 ( unsigned int *a )
{
    *a=0;
}
unsigned int fun2 ( unsigned int *a, unsigned int *b )
{
    return(*a^*b);
}
unsigned int fun3 ( unsigned int a, unsigned int b )
{
    return(a^b);
}


0000000000000000 <fun1>:
   0:   c7 07 00 00 00 00       movl   $0x0,(%rdi)
   6:   c3                      retq   
   7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   e:   00 00 

0000000000000010 <fun2>:
  10:   8b 06                   mov    (%rsi),%eax
  12:   33 07                   xor    (%rdi),%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    nopw   %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00 

0000000000000020 <fun3>:
  20:   89 f0                   mov    %esi,%eax
  22:   31 f8                   xor    %edi,%eax
  24:   c3                      retq   

Heads down the path of showing what to variables xor i,i as in your question might lead to. Since you didnt specify what processor or what context you were referring it is difficult to paint the whole picture. If for example you are talking about C code, you have to understand what compilers do to that code, and that depends heavily on the code in the function itself, if at the time of your xor the compiler has the operand in a register and depending on your compiler settings you might get the xor eax,eax. or the compiler can choose to change that to a mov reg,0, or change a something=0; to an xor reg,reg.

Some more sequences to ponder:

if the address to the variable is already in a register:

   7:   c7 07 00 00 00 00       movl   $0x0,(%rdi)

   d:   8b 07                   mov    (%rdi),%eax
   f:   31 c0                   xor    %eax,%eax
  11:   89 07                   mov    %eax,(%rdi)

The compiler will choose the mov zero instead of the xor. Which is what you would get if you tried this C code:

void funx ( unsigned int *a )
{
    *a=*a^*a;
}

The compiler replaces it with a move zero. Same number of bytes fetched, but two memory accessed needed instead of one, and a register burned. and three instructions to execute instead of one. So the move zero is noticeably better.

Now if it is byte sized and in a register:

13: b0 00                   mov    $0x0,%al
15: 30 c0                   xor    %al,%al

no difference in code size. (But they still execute differently).


Now if you were talking about another processor, lets say ARM

   0:   e3a00000    mov r0, #0
   4:   e0200000    eor r0, r0, r0
   8:   e3a00000    mov r0, #0
   c:   e5810000    str r0, [r1]
  10:   e5910000    ldr r0, [r1]
  14:   e0200000    eor r0, r0, r0
  18:   e5810000    str r0, [r1]

You dont save anything by using the xor (exclusive or, eor): one instruction is one instruction both fetched and execution. xoring something in ram, just like any processor if you have the address of the variable in a register. If you have to copy the data to another register to perform the xor, then you still end up with two memory accesses and three instructions. If you have a processor that can do memory to memory the move of zero is cheaper because you only have the one memory access and one or two instructions depending on the processor.

In fact it's worse than that: eor r0, r0, r0 is required to have an input dependency on r0 (limiting out-of-order execution), because of memory-ordering rules. Xor-zeroing always produces zero, but only helps performance in x86 assembly.


So the bottom line is it depends, if you are talking registers in assembler on an x86 system anywhere from 8088 to the present the xor is often faster because the instruction is smaller, fetches faster, takes less cache if you have one, leaves more cache for other code, etc. Likewise non-x86 variable instruction length processors that require the zero to be encoded in the instruction will also require a longer instruction, longer fetch time, more cache consumed if there is a cache, etc. So the xor is faster (usually, depends on how it encodes). It gets much worse if you have conditional flags and you want that move/xor to set the zero flag, you may have to burn the right instruction (on some processors the mov does not change the flags). Some processors have a special zero register, that is not general purpose, when you use it you get a zero that way you can encode this very common use case without burning more instruction space or burning an extra instruction cycle loading a zero immediate into a register. msp430 for example, a move of 0x1234 would cost you a two word instruction, but move 0x0000 or 0x0001 and a few other constants can be encoded in a single instruction word. All processors will have the double hit to memory if you are talking about a variable in ram, read-modify-write two memory cycles not counting the instruction fetches, and gets worse if the read causes a cache line fill (the write would then be very fast), but without the read the write only might pass right by the cache and execute very fast as the processor could keep running while the write was going on in parallel (sometimes you get that performance gain, sometimes not, always if you tune for it). The x86 and likely older processors are the reason why you see the habit of xoring instead of moving zero. The performance gain is still there today for those specific optimizations, system memory is still extremely slow and any extra memory cycles are costly, likewise any cache that is thrown away is costly. Halfway decent compilers, even gcc, will detect an xor i,i as being equivalent to i=0 and choose on a case by case basis the better (on an average system) instruction sequence.

Get a copy of the Zen of Assembly by Michael Abrash. Good, used copies are available at a reasonable price (under $50), even if you go for the $80 copies it is well worth it. Try to look beyond the particular 8088 "cycle eaters" and understand the general thought process he is trying to teach. Then spend as much time as you can disassembling your code, ideally for many different processors. Apply what you have learned...

like image 140
old_timer Avatar answered Dec 04 '22 19:12

old_timer


On older CPU's (but those after the Pentium Pro, as per the comments) this used to be the case, however, most modern CPU these days have special hot paths for zero assignment (of registers and well aligned variables) that should yield equivalent performance. most modern compilers will tend to use a mix of the two, depending on the surrounding code (older MSVC compilers would always use XOR in optimized builds, and it still does use XOR quite a bit, but will also use MOV reg,0 in certain circumstances).

This is very much of a micro optimization, so tbh, you can just do what ever suites you best, unless you have tight loops that are lagging due to register dependencies. it should be noted however, that use XOR takes up less space most of the time, which is great for embedded devices or when your are try to align a branch target.

this assumes that you are mainly referring to x86 and its derivatives, on that note @Pascal gave me the idea to put in the technical references that for the basis for this. The Intel Optimization manual has two sections dealing with this, namely, 2.1.3.1 Dependancy Breaking Idioms and 3.5.1.7 Clearing Registers and Dependancy Breaking Idioms. These two sections basical advocate using XOR based instructions for any form of register clearing due its dependancy breaking nature (which removes latency). But in sections where condition codes need preserving, MOVing 0 into a register is prefered.

like image 23
Necrolis Avatar answered Dec 04 '22 18:12

Necrolis