GNU inline assembly optimisation

Question

I am trying to write a small library for highly optimised x86-64 bit operation code and am fiddling with inline asm.

While testing this particular case has caught my attention:

unsigned long test = 0;
unsigned long bsr;

// bit test and set 39th bit
__asm__ ("btsq	%1, %0 " : "+rm" (test) : "rJ" (39) );

// bit scan reverse (get most significant bit id)
__asm__ ("bsrq	%1, %0" : "=r" (bsr) : "rm" (test) );

printf("test = %lu, bsr = %d
", test, bsr);

compiles and runs fine in both gcc and icc, but when I inspect the assembly I get differences

gcc -S -fverbose-asm -std=gnu99 -O3

movq    $0, -8(%rbp)
## InlineAsm Start
btsq    $39, -8(%rbp) 
## InlineAsm End
movq    -8(%rbp), %rax
movq    %rax, -16(%rbp)
## InlineAsm Start
bsrq    -16(%rbp), %rdx
## InlineAsm End
movq    -8(%rbp), %rsi
leaq    L_.str(%rip), %rdi
xorb    %al, %al
callq   _printf

I am wondering why so complicated? I am writing high performance code in which the number of instructions is critical. I am especially wondering why gcc makes a copy of my variable test before passing it to the second inline asm?

Same code compiled with icc gives far better results:

    xorl      %esi, %esi                                    # test = 0
    movl      $.L_2__STRING.0, %edi                         # has something to do with printf
    orl       $32832, (%rsp)                                # part of function initiation
    xorl      %eax, %eax                                    # has something to do with printf
    ldmxcsr   (%rsp)                                        # part of function initiation
    btsq      $39, %rsi                                     #106.0
    bsrq      %rsi, %rdx                                    #109.0
    call      printf                                        #111.2

despite the fact that gcc decides to keep my variables on the stack rather then in registers, what I do not understand is why make a copy of test before passing it to the second asm? If I put test in as an input/output variable in the second asm

__asm__ ("bsrq %1, %0" : "=r" (bsr) , "+rm" (test) );

then those lines disappear.

movq    $0, -8(%rbp)
## InlineAsm Start
btsq    $39, -8(%rbp) 
## InlineAsm End
## InlineAsm Start
bsrq    -8(%rbp), %rdx
## InlineAsm End
movq    -8(%rbp), %rsi
leaq    L_.str(%rip), %rdi
xorb    %al, %al
callq   _printf

Is this gcc screwed up optimisation or am I missing some vital compiler switches? I do have icc for my production system, but if I decide to distribute the source code at some point then it will have to be able to compile with gcc too.

compilers used:

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)

icc Version 12.0.2

FrankH. · Accepted Answer

I've tried your example on Linux like this (making it "evil" by forcing a stack ref/loc for test through using &test in the printf:):

#include <stdio.h>
int main(int argc, char **argv)
{
    unsigned long test = 0;
    unsigned long bsr;
// bit test and set 39th bit
    asm ("btsq	%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
    asm ("bsrq	%1, %0" : "=r" (bsr) : "rm" (test) );
    printf("test = %lu, bsr = %d, &test = %p
", test, bsr, &test);
    return 0;
}

and compiled it with various versions of gcc -O3 ... to the following results:

code generated                                                     gcc version
================================================================================
  400630:       48 83 ec 18             sub    $0x18,%rsp          4.7.2,
  400634:       31 c0                   xor    %eax,%eax           4.6.2,
  400636:       bf 50 07 40 00          mov    $0x400750,%edi      4.4.6
  40063b:       48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  400640:       48 0f ba e8 27          bts    $0x27,%rax
  400645:       48 89 44 24 08          mov    %rax,0x8(%rsp)
  40064a:       48 89 c6                mov    %rax,%rsi
  40064d:       48 0f bd d0             bsr    %rax,%rdx
  400651:       31 c0                   xor    %eax,%eax
  400653:       e8 68 fe ff ff          callq  4004c0 
[ ... ]
---------------------------------------------------------------------------------
  4004f0:       48 83 ec 18             sub    $0x18,%rsp          4.1
  4004f4:       31 c0                   xor    %eax,%eax
  4004f6:       bf 28 06 40 00          mov    $0x400628,%edi
  4004fb:       48 8d 4c 24 10          lea    0x10(%rsp),%rcx
  400500:       48 c7 44 24 10 00 00 00 00      movq   $0x0,0x10(%rsp)
  400509:       48 0f ba e8 27          bts    $0x27,%rax
  40050e:       48 89 44 24 10          mov    %rax,0x10(%rsp)
  400513:       48 89 c6                mov    %rax,%rsi
  400516:       48 0f bd d0             bsr    %rax,%rdx
  40051a:       31 c0                   xor    %eax,%eax
  40051c:       e8 c7 fe ff ff          callq  4003e8 
[ ... ]
---------------------------------------------------------------------------------
  400500:       48 83 ec 08             sub    $0x8,%rsp           3.4.5
  400504:       bf 30 06 40 00          mov    $0x400630,%edi
  400509:       31 c0                   xor    %eax,%eax
  40050b:       48 c7 04 24 00 00 00 00         movq   $0x0,(%rsp)
  400513:       48 89 e1                mov    %rsp,%rcx
  400516:       48 0f ba 2c 24 27       btsq   $0x27,(%rsp)
  40051c:       48 8b 34 24             mov    (%rsp),%rsi
  400520:       48 0f bd 14 24          bsr    (%rsp),%rdx
  400525:       e8 fe fe ff ff          callq  400428 
[ ... ]
---------------------------------------------------------------------------------
  4004e0:       48 83 ec 08             sub    $0x8,%rsp           3.2.3
  4004e4:       bf 10 06 40 00          mov    $0x400610,%edi
  4004e9:       31 c0                   xor    %eax,%eax
  4004eb:       48 c7 04 24 00 00 00 00         movq   $0x0,(%rsp)
  4004f3:       48 0f ba 2c 24 27       btsq   $0x27,(%rsp)
  4004f9:       48 8b 34 24             mov    (%rsp),%rsi
  4004fd:       48 89 e1                mov    %rsp,%rcx
  400500:       48 0f bd 14 24          bsr    (%rsp),%rdx
  400505:       e8 ee fe ff ff          callq  4003f8 
[ ... ]

and while there's a significant difference in the created code (including whether the bsr acceesses test as register or memory), none of the tested revs recreate the assembly that you've shown. I'd suspect a bug in the 4.2.x version you used on MacOSX, but then I don't have either your testcase nor that specific compiler version available.

Edit: The code above is obviously different in the sense that it forces test into the stack; if that is not done, then all "plain" gcc versions I've tested do a direct pair bts $39, %rsi / bsr %rsi, %rdx.

I have found, though, that clang creates different code there:

 140:   50                      push   %rax
 141:   48 c7 04 24 00 00 00 00         movq   $0x0,(%rsp)
 149:   31 f6                   xor    %esi,%esi
 14b:   48 0f ba ee 27          bts    $0x27,%rsi
 150:   48 89 34 24             mov    %rsi,(%rsp)
 154:   48 0f bd d6             bsr    %rsi,%rdx
 158:   bf 00 00 00 00          mov    $0x0,%edi
 15d:   30 c0                   xor    %al,%al
 15f:   e8 00 00 00 00          callq  printf@plt>

so the difference seems to be indeed between the code generators of clang/llvm and "gcc proper".

GNU inline assembly optimisation

Tags:

optimization

gcc

x86-64

icc

inline-assembly

Sergey L.

1 Answers

FrankH.

Recent Activity

Donate For Us

GNU inline assembly optimisation

Tags:

optimization

gcc

x86-64

icc

inline-assembly

Sergey L.

1 Answers

FrankH.

Related questions

Recent Activity

Donate For Us