Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Making g++ use SHLD/SHRD instructions

Consider the following code:

#include <limits>
#include <cstdint>

using T = uint32_t; // or uint64_t

T shift(T x, T y, T n)
{
    return (x >> n) | (y << (std::numeric_limits<T>::digits - n));
}

According to godbolt, clang 3.8.1 generates the following assembly code for -O1, -O2, -O3:

shift(unsigned int, unsigned int, unsigned int):
        movb    %dl, %cl
        shrdl   %cl, %esi, %edi
        movl    %edi, %eax
        retq

While gcc 6.2 (even with -mtune=haswell) generates:

shift(unsigned int, unsigned int, unsigned int):
    movl    $32, %ecx
    subl    %edx, %ecx
    sall    %cl, %esi
    movl    %edx, %ecx
    shrl    %cl, %edi
    movl    %esi, %eax
    orl     %edi, %eax
    ret

This seems far less optimized, since SHRD is very fast on Intel Sandybridge and later. Is there anyway to rewrite the function to facilitate optimization by compilers (and in particular gcc) and to favor the use of SHLD/SHRD assembly instructions?

Or are there any gcc -mtune or other options that would encourage gcc to tune better for modern Intel CPUs?

With -march=haswell, it emits BMI2 shlx / shrx, but still not shrd.

like image 973
Vincent Avatar asked Sep 01 '16 22:09

Vincent


1 Answers

No, I can see no way to get gcc to use the SHRD instruction.
You can manipulate the output that gcc generates by changing the -mtune and -march options.

Or are there any gcc -mtune or other options that would encourage gcc to tune better for modern Intel CPUs?

Yes you can get gcc to generate BMI2 code:

E.g: X86-64 GCC6.2 -O3 -march=znver1 //AMD Zen
Generates: (Haswell timings).

    code            critical path latency     reciprocal throughput
    ---------------------------------------------------------------
    mov     eax, 32          *                     0.25
    sub     eax, edx         1                     0.25        
    shlx    eax, esi, eax    1                     0.5
    shrx    esi, edi, edx    *                     0.5
    or      eax, esi         1                     0.25
    ret
    TOTAL:                   3                     1.75

Compared with clang 3.8.1:

    mov    cl, dl            1                     0.25
    shrd   edi, esi, cl      4                     2
    mov    eax, edi          *                     0.25 
    ret
    TOTAL                    5                     2.25

Given the dependency chain here: SHRD is slower on Haswell, tied on Sandybridge, slower on Skylake.
The reciprocal throughput is faster for the shrx sequence.

So it depends, on post BMI processors gcc produces better code, pre-BMI clang wins.
SHRD has wildly varying timings on different processors, I can see why gcc is not overly fond of it.
Even with -Os (optimize for size) gcc still does not select SHRD.

*) Not part of the timing because either not on the critical path, or turns into a zero latency register rename.

like image 200
Johan Avatar answered Oct 16 '22 10:10

Johan