The following code compiling with gcc or clang on amd64
// gcc -O2 file.c -c
int f(int a, int b, int c, int d)
{
return a & b & c & d;
}
produces following assembly:
0000000000000000 <f>:
0: 89 d0 mov %edx,%eax
2: 21 c8 and %ecx,%eax
4: 21 f0 and %esi,%eax
6: 21 f8 and %edi,%eax
8: c3 retq
As bitwise and
should be associative one would assume it would be more efficient to accumulate pairwise into two register and then and
those two registers. This would break a dependency and allow parallel execution on a CPU with more than one ALU.
As the compiler does and
into the same register for all operations I am assuming that it relies on the cpu being able to do register renaming to break the dependency itself.
Does the register renaming feature of a CPU have no cost and is always available on amd64 or why do compilers compile the code like this?
Update:
I have found that gcc can perform the expected dependency chain breaking if one passes it a higher value for the tree-assoc-width:
--param tree-reassoc-width=2
This looks like the compiler just being insufficiently clever. Although Intel's Ivy Bridge and Haswell microarchitectures support move elimination, so mov %edx,%eax; and %ecx, %eax
would effectively become and %ecx, %edx -->%eax
, this sequence would still take three cycles (ignoring the fact that such a small sequential dependency chain would be hidden by a modest out-of-order execution window). If the compiler were clever, something more like the following might have been generated:
and %esi,%edi
and %edx,%ecx
mov %edi,%eax
and %ecx,%eax
retq
As you noted, this would break the dependence chain. (With move elimination, the last three instruction have no data dependencies, so if the function call was an instruction [and L2 and L3 miss] and the previous instructions committed while the front-end waited for the instruction cache miss to be handled and a zero-overhead timer was read after the return instruction committed [assuming no target misprediction on the return] might take one cycle less than the code generated by gcc.) A two-wide in-order processor would execute and %esi,%edi; and %edx,%ecx
in one cycle, move %edi,%eax
in the next, and and %ecx,%eax; retq
in the third, whereas for the gcc-generated code mov %edx,%eax
would be executed in the first cycle, and %ecx,%eax
in the second, and %esi,%eax
in the third, and and %edi,%eax; retq
in the fourth.
Register renaming does not break true data dependency chains but removes name dependencies (Write-After-Read [where the write is supposed to occur after the read, so the read gets the old value] and Write-After-Write hazards are name dependencies [technically, a write without reads could be dropped, but detecting that no reads are made and that the later write is not speculative is generally not considered worthwhile]; Read-After-Write is a true data dependence and Read-After-Read has no dependence). On an implementation with out-of-order execution, register renaming is part of the ordinary operation; in that sense it could be considered to have "no cost".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With