Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cost of x86 register renaming

The following code compiling with gcc or clang on amd64

// gcc -O2 file.c -c
int f(int a, int b, int c, int d)
{
    return a & b & c & d;
}

produces following assembly:

0000000000000000 <f>:
  0:    89 d0                   mov    %edx,%eax
  2:    21 c8                   and    %ecx,%eax
  4:    21 f0                   and    %esi,%eax
  6:    21 f8                   and    %edi,%eax
  8:    c3                      retq  

As bitwise and should be associative one would assume it would be more efficient to accumulate pairwise into two register and then and those two registers. This would break a dependency and allow parallel execution on a CPU with more than one ALU.

As the compiler does and into the same register for all operations I am assuming that it relies on the cpu being able to do register renaming to break the dependency itself.

Does the register renaming feature of a CPU have no cost and is always available on amd64 or why do compilers compile the code like this?

Update:

I have found that gcc can perform the expected dependency chain breaking if one passes it a higher value for the tree-assoc-width:

--param tree-reassoc-width=2
like image 401
jtaylor Avatar asked Mar 27 '14 00:03

jtaylor


1 Answers

This looks like the compiler just being insufficiently clever. Although Intel's Ivy Bridge and Haswell microarchitectures support move elimination, so mov %edx,%eax; and %ecx, %eax would effectively become and %ecx, %edx -->%eax, this sequence would still take three cycles (ignoring the fact that such a small sequential dependency chain would be hidden by a modest out-of-order execution window). If the compiler were clever, something more like the following might have been generated:

and    %esi,%edi
and    %edx,%ecx
mov    %edi,%eax
and    %ecx,%eax
retq  

As you noted, this would break the dependence chain. (With move elimination, the last three instruction have no data dependencies, so if the function call was an instruction [and L2 and L3 miss] and the previous instructions committed while the front-end waited for the instruction cache miss to be handled and a zero-overhead timer was read after the return instruction committed [assuming no target misprediction on the return] might take one cycle less than the code generated by gcc.) A two-wide in-order processor would execute and %esi,%edi; and %edx,%ecx in one cycle, move %edi,%eax in the next, and and %ecx,%eax; retq in the third, whereas for the gcc-generated code mov %edx,%eax would be executed in the first cycle, and %ecx,%eax in the second, and %esi,%eax in the third, and and %edi,%eax; retq in the fourth.

Register renaming does not break true data dependency chains but removes name dependencies (Write-After-Read [where the write is supposed to occur after the read, so the read gets the old value] and Write-After-Write hazards are name dependencies [technically, a write without reads could be dropped, but detecting that no reads are made and that the later write is not speculative is generally not considered worthwhile]; Read-After-Write is a true data dependence and Read-After-Read has no dependence). On an implementation with out-of-order execution, register renaming is part of the ordinary operation; in that sense it could be considered to have "no cost".

like image 97
Paul A. Clayton Avatar answered Sep 20 '22 15:09

Paul A. Clayton