Consider the following code:
#include <limits>
#include <cstdint>
using T = uint32_t; // or uint64_t
T shift(T x, T y, T n)
{
return (x >> n) | (y << (std::numeric_limits<T>::digits - n));
}
According to godbolt, clang 3.8.1 generates the following assembly code for -O1, -O2, -O3:
shift(unsigned int, unsigned int, unsigned int):
movb %dl, %cl
shrdl %cl, %esi, %edi
movl %edi, %eax
retq
While gcc 6.2 (even with -mtune=haswell
) generates:
shift(unsigned int, unsigned int, unsigned int):
movl $32, %ecx
subl %edx, %ecx
sall %cl, %esi
movl %edx, %ecx
shrl %cl, %edi
movl %esi, %eax
orl %edi, %eax
ret
This seems far less optimized, since SHRD is very fast on Intel Sandybridge and later. Is there anyway to rewrite the function to facilitate optimization by compilers (and in particular gcc) and to favor the use of SHLD/SHRD assembly instructions?
Or are there any gcc -mtune
or other options that would encourage gcc to tune better for modern Intel CPUs?
With -march=haswell
, it emits BMI2 shlx / shrx, but still not shrd.
No, I can see no way to get gcc to use the SHRD
instruction.
You can manipulate the output that gcc generates by changing the -mtune
and -march
options.
Or are there any gcc -mtune or other options that would encourage gcc to tune better for modern Intel CPUs?
Yes you can get gcc to generate BMI2 code:
E.g: X86-64 GCC6.2 -O3 -march=znver1 //AMD Zen
Generates: (Haswell timings).
code critical path latency reciprocal throughput
---------------------------------------------------------------
mov eax, 32 * 0.25
sub eax, edx 1 0.25
shlx eax, esi, eax 1 0.5
shrx esi, edi, edx * 0.5
or eax, esi 1 0.25
ret
TOTAL: 3 1.75
Compared with clang 3.8.1:
mov cl, dl 1 0.25
shrd edi, esi, cl 4 2
mov eax, edi * 0.25
ret
TOTAL 5 2.25
Given the dependency chain here: SHRD
is slower on Haswell, tied on Sandybridge, slower on Skylake.
The reciprocal throughput is faster for the shrx
sequence.
So it depends, on post BMI processors gcc produces better code, pre-BMI clang wins.SHRD
has wildly varying timings on different processors, I can see why gcc is not overly fond of it.
Even with -Os
(optimize for size) gcc still does not select SHRD
.
*) Not part of the timing because either not on the critical path, or turns into a zero latency register rename.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With