Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

64 bit code generated by GCC is 3 times slower than 32 bit

I've noticed that my code runs on 64 bit Linux much slower than on 32 bit Linux or 64 bit Window or 64 bit Mac. This is minimal test case.

#include <stdlib.h>

typedef unsigned char UINT8;

void
stretch(UINT8 * lineOut, UINT8 * lineIn, int xsize, float *kk)
{
    int xx, x;

    for (xx = 0; xx < xsize; xx++) {
        float ss = 0.0;
        for (x = 0; x < xsize; x++) {
            ss += lineIn[x] * kk[x];
        }
        lineOut[xx] = (UINT8) ss;
    }
}

int
main( int argc, char** argv )
{
    int i;
    int xsize = 2048;

    UINT8 *lineIn = calloc(xsize, sizeof(UINT8));
    UINT8 *lineOut = calloc(xsize, sizeof(UINT8));
    float *kk = calloc(xsize, sizeof(float));

    for (i = 0; i < 1024; i++) {
        stretch(lineOut, lineIn, xsize, kk);
    }

    return 0;
}

And there is how it runs:

$ cc --version
cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
$ cc -O2 -Wall -m64 ./tt.c -o ./tt && time ./tt
user  14.166s
$ cc -O2 -Wall -m32 ./tt.c -o ./tt && time ./tt
user  5.018s

As you can see, 32 bit version runs almost 3 times faster (I've tested both on 32bit and 64bit Ubuntu, result the same). And even more strange what performance depends on C standard:

$ cc -O2 -Wall -std=c99 -m32 ./tt.c -o ./tt && time ./tt
user  15.825s
$ cc -O2 -Wall -std=gnu99 -m32 ./tt.c -o ./tt && time ./tt
user  5.090s

How can it be? How can I workaround this to speed up 64 bit version generated by GCC.

Update 1

I've compared assembler produced by fast 32 bit (default and gnu99) and slow (c99) and found following:

.L5:
  movzbl    (%ebx,%eax), %edx   # MEM[base: lineIn_10(D), index: _72, offset: 0B], D.1543
  movl  %edx, (%esp)    # D.1543,
  fildl (%esp)  #
  fmuls (%esi,%eax,4)   # MEM[base: kk_18(D), index: _72, step: 4, offset: 0B]
  addl  $1, %eax    #, x
  cmpl  %ecx, %eax  # xsize, x
  faddp %st, %st(1) #,
  fstps 12(%esp)    #
  flds  12(%esp)    #
  jne   .L5 #,

There is no fstps and flds commands in fast case. So GCC stores and loads value from memory on each step. I've tried register float type, but this doesn't help.

Update 2

I've tested on gcc-4.9 and looks like it generates optimal code for 64 bit. And -ffast-math (suggested by @jch) fixes -m32 -std=c99 for both GCC versions. I'm still looking for solution for 64 bit on gcc-4.8, because it is more common version for now that 4.9.

like image 703
homm Avatar asked Oct 27 '14 10:10

homm


3 Answers

There is a partial dependency stall in the code generated by older versions of GCC.

movzbl (%rsi,%rax), %r8d
cvtsi2ss %r8d, %xmm0  ;; all upper bits in %xmm0 are false dependency

The dependency can be broken by xorps.

#ifdef __SSE__
float __attribute__((always_inline)) i2f(int v) {
    float x;
    __asm__("xorps %0, %0; cvtsi2ss %1, %0" : "=x"(x) : "r"(v) );
    return x;
}
#else
float __attribute__((always_inline)) i2f(int v) { return (float) v; }
#endif

void stretch(UINT8* lineOut, UINT8* lineIn, int xsize, float *kk)
{
    int xx, x;

    for (xx = 0; xx < xsize; xx++) {
        float ss = 0.0;
        for (x = 0; x < xsize; x++) {
            ss += i2f(lineIn[x]) * kk[x];
        }
        lineOut[xx] = (UINT8) ss;
    }
}

Results

$ cc -O2 -Wall -m64 ./test.c -o ./test64 && time ./test64
./test64  4.07s user 0.00s system 99% cpu 4.070 total
$ cc -O2 -Wall -m32 ./test.c -o ./test32 && time ./test32
./test32  3.94s user 0.00s system 99% cpu 3.938 total
like image 52
Vyacheslav Egorov Avatar answered Sep 25 '22 02:09

Vyacheslav Egorov


Here is what I tried: I declared ss as volatile. This prevented the compiler from doing optimizations on it. I got similar times for both 32 and 64 bit versions.

64bit was slightly slower but this is normal because 64bit code is larger and the uCode cache has a finite size. So in general 64bit should be very slightly slower than 32 (<3-4%).

Getting back to the problem, I think that in 32bit mode the compiler makes more aggressive optimizations on ss.

Update 1:

Looking at the 64bit code, it generates a CVTTSS2SI instruction, paired with a CVTSI2SS instruction for float to integer conversion. This has a higher latency. The 32bit code just uses a FMULS instruction, operating directly on floats. Need to look for a compiler option to prevent these conversions.

like image 32
VAndrei Avatar answered Sep 22 '22 02:09

VAndrei


In 32 bit mode, the compiler is making extra efforts to preserve strict IEEE 754 floating point semantics. You can avoid this by compiling with -ffast-math:

$ gcc -m32 -O2 -std=c99 test.c && time ./a.out 

real    0m13.869s
user    0m13.884s
sys     0m0.000s
$ gcc -m32 -O2 -std=c99 -ffast-math test.c && time ./a.out 

real    0m4.477s
user    0m4.480s
sys     0m0.000s

I cannot reproduce your results in 64-bit mode, but I'm pretty confident that -ffast-math will solve your issues. More generally, unless you really need reproducible IEEE 754 rounding behaviour, -ffast-math is what you want.

like image 24
jch Avatar answered Sep 25 '22 02:09

jch