Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC x86-64 Suboptimal Assembly Output, why?

When viewing the assembly output of the following code (no optimizations, -O2 and -O3 produce very similar results):

int main(int argc, char **argv)
{
    volatile float f1 = 1.0f;
    volatile float f2 = 2.0f;

    if(f1 > f2)
    {
        puts("+");
    }
    else if(f1 < f2)
    {
        puts("-");
    }

    return 0;
}

GCC does something that I have a hard time following:

.LC2:
    .string "+"
.LC3:
    .string "-"
    .text
.globl main
    .type   main, @function
main:
.LFB2:
    pushq   %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    subq    $32, %rsp
.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movl    $0x3f800000, %eax
    movl    %eax, -4(%rbp)
    movl    $0x40000000, %eax
    movl    %eax, -8(%rbp)
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm0, %xmm1
    jbe .L9
.L7:
    movl    $.LC2, %edi
    call    puts
    jmp .L4
.L9:
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm1, %xmm0
    jbe .L4
.L8:
    movl    $.LC3, %edi
    call    puts
.L4:
    movl    $0, %eax
    leave
    ret

Why does GCC move the the float values into xmm0 and xmm1 twice and also run ucomiss twice?

Wouldn't it be faster to do the following?

.LC2:
    .string "+"
.LC3:
    .string "-"
    .text
.globl main
    .type   main, @function
main:
.LFB2:
    pushq   %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    subq    $32, %rsp
.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movl    $0x3f800000, %eax
    movl    %eax, -4(%rbp)
    movl    $0x40000000, %eax
    movl    %eax, -8(%rbp)
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm0, %xmm1
    jb  .L8 # jump if less than
    je  .L4 # jump if equal
.L7:
    movl    $.LC2, %edi
    call    puts
    jmp .L4
.L8:
    movl    $.LC3, %edi
    call    puts
.L4:
    movl    $0, %eax
    leave
    ret

I'm not at all a real assembly programmer, but it just seemed odd to me to have duplicate instructions running. Is there a problem with my version of the code?


Update

If you remove the volatile which I had originally and replace it with scanf(), you get the same results:

int main(int argc, char **argv)
{
    float f1;
    float f2;

    scanf("%f", &f1);
    scanf("%f", &f2);

    if(f1 > f2)
    {
        puts("+");
    }
    else if(f1 < f2)
    {
        puts("-");
    }

    return 0;
}

And the corresponding assembler:

.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    leaq    -4(%rbp), %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    scanf
    leaq    -8(%rbp), %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    scanf
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm0, %xmm1
    jbe .L9
.L7:
    movl    $.LC1, %edi
    call    puts
    jmp .L4
.L9:
    movss   -4(%rbp), %xmm1
    movss   -8(%rbp), %xmm0
    ucomiss %xmm1, %xmm0
    jbe .L4
.L8:
    movl    $.LC2, %edi
    call    puts
.L4:
    movl    $0, %eax
    leave
    ret

Final Update

After reviewing some of the follow up comments, it seems han (who commented under Jonathan Leffler's post) nailed this problem. GCC does not make the optimization not because it can't but because I hadn't told it to. It seems it all comes down to IEEE floating point rules and to handle the strict conditions GCC can't simply do a jump if above or jump if below after the first UCOMISS, because it needs to handle all the special conditions of floating point numbers. When using han's recommendation of the -ffast-math optimizer (none of the -Ox flags enable -ffast-math as it can break some programs) GCC does exactly what I was looking for:

The following assembly was produced using GCC 4.3.2 "gcc -S -O3 -ffast-math test.c"

.LC0:
    .string "%f"
.LC1:
    .string "+"
.LC2:
    .string "-"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB25:
    subq    $24, %rsp
.LCFI0:
    movl    $.LC0, %edi
    xorl    %eax, %eax
    leaq    20(%rsp), %rsi
    call    scanf
    leaq    16(%rsp), %rsi
    xorl    %eax, %eax
    movl    $.LC0, %edi
    call    scanf
    movss   20(%rsp), %xmm0
    comiss  16(%rsp), %xmm0
    ja  .L11
    jb  .L12
    xorl    %eax, %eax
    addq    $24, %rsp
    .p2align 4,,1
    .p2align 3
    ret
    .p2align 4,,10
    .p2align 3
.L12:
    movl    $.LC2, %edi
    call    puts
    xorl    %eax, %eax
    addq    $24, %rsp
    ret
    .p2align 4,,10
    .p2align 3
.L11:
    movl    $.LC1, %edi
    call    puts
    xorl    %eax, %eax
    addq    $24, %rsp
    ret

Notice the two UCOMISS instructions are now replaced with one COMISS directly followed by a JA (jump if above) and JB (jump if below). GCC is able to nail this optimization if you let it using -ffast-math!

UCOMISS vs COMISS (http://www.softeng.rl.ac.uk/st/archive/SoftEng/SESP/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc315.htm): "The UCOMISS instruction differs from the COMISS instruction in that it signals an invalid SIMD floating-point exception only when a source operand is an SNaN. The COMISS instruction signals invalid if a source operand is either a QNaN or an SNaN."

Thanks again everyone for the helpful discussion.

like image 681
Brandon Avatar asked Sep 19 '11 01:09

Brandon


2 Answers

Here's another reason: If you take a close look at it, it's NOT the same expression.

They are not complements of each other. Therefore, you have to do two comparisons anyway. volatile will force the values to be reloaded.

EDIT: (see comments, I forgot you can do that with the flags)

To answer the new question:

Combining the those two ucomiss is not a completely obvious optimization from the compiler's perspective.

In order to combine them, the compiler must:

  1. Recognize that ucomiss %xmm0, %xmm1 is the "same" as ucomiss %xmm1, %xmm0.
  2. Then it must do a common sub-expression elimination pass to pull it out.

All of this needs to be done after the compiler does instruction selection. And most of the optimization passes are done before instruction selection.

What worries me more is why f1 and f2 aren't being kept in registers after you got rid of the volatiles. -O3 is really giving you this?

like image 149
Mysticial Avatar answered Oct 19 '22 22:10

Mysticial


The volatile qualifier means that the values of f1 and f2 may change in ways the compiler cannot detect/expect, so it must access the memory every time it uses either f1 or f2. The generated code does that - so it is correct.

Compare and contrast with the code you get if you remove the volatile qualifiers from either variable, or both variables. You might, ultimately, need to read the values of f1 and f2 from somewhere in order to avoid the compiler evaluating the expressions at compile time.


In the updated code, you get two different incantations for the ucomiss instruction, though the preceding movss instructions are the same:

    ucomiss %xmm0, %xmm1
    ucomiss %xmm1, %xmm0

The order of the operands for the ucomiss instruction is reversed for the reversed condition:

if (f1 > f2)
if (f1 < f2)

I'm not convinced the optimizer is optimizing where it could, but the question is morphing beyond my level of expertise.

like image 34
Jonathan Leffler Avatar answered Oct 19 '22 22:10

Jonathan Leffler