I've noticed that my code runs on 64 bit Linux much slower than on 32 bit Linux or 64 bit Window or 64 bit Mac. This is minimal test case.
#include <stdlib.h>
typedef unsigned char UINT8;
void
stretch(UINT8 * lineOut, UINT8 * lineIn, int xsize, float *kk)
{
int xx, x;
for (xx = 0; xx < xsize; xx++) {
float ss = 0.0;
for (x = 0; x < xsize; x++) {
ss += lineIn[x] * kk[x];
}
lineOut[xx] = (UINT8) ss;
}
}
int
main( int argc, char** argv )
{
int i;
int xsize = 2048;
UINT8 *lineIn = calloc(xsize, sizeof(UINT8));
UINT8 *lineOut = calloc(xsize, sizeof(UINT8));
float *kk = calloc(xsize, sizeof(float));
for (i = 0; i < 1024; i++) {
stretch(lineOut, lineIn, xsize, kk);
}
return 0;
}
And there is how it runs:
$ cc --version
cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
$ cc -O2 -Wall -m64 ./tt.c -o ./tt && time ./tt
user 14.166s
$ cc -O2 -Wall -m32 ./tt.c -o ./tt && time ./tt
user 5.018s
As you can see, 32 bit version runs almost 3 times faster (I've tested both on 32bit and 64bit Ubuntu, result the same). And even more strange what performance depends on C standard:
$ cc -O2 -Wall -std=c99 -m32 ./tt.c -o ./tt && time ./tt
user 15.825s
$ cc -O2 -Wall -std=gnu99 -m32 ./tt.c -o ./tt && time ./tt
user 5.090s
How can it be? How can I workaround this to speed up 64 bit version generated by GCC.
Update 1
I've compared assembler produced by fast 32 bit (default and gnu99) and slow (c99) and found following:
.L5:
movzbl (%ebx,%eax), %edx # MEM[base: lineIn_10(D), index: _72, offset: 0B], D.1543
movl %edx, (%esp) # D.1543,
fildl (%esp) #
fmuls (%esi,%eax,4) # MEM[base: kk_18(D), index: _72, step: 4, offset: 0B]
addl $1, %eax #, x
cmpl %ecx, %eax # xsize, x
faddp %st, %st(1) #,
fstps 12(%esp) #
flds 12(%esp) #
jne .L5 #,
There is no fstps
and flds
commands in fast case. So GCC stores and loads value from memory on each step. I've tried register float
type, but this doesn't help.
Update 2
I've tested on gcc-4.9 and looks like it generates optimal code for 64 bit. And -ffast-math
(suggested by @jch) fixes -m32 -std=c99
for both GCC versions. I'm still looking for solution for 64 bit on gcc-4.8, because it is more common version for now that 4.9.
There is a partial dependency stall in the code generated by older versions of GCC.
movzbl (%rsi,%rax), %r8d
cvtsi2ss %r8d, %xmm0 ;; all upper bits in %xmm0 are false dependency
The dependency can be broken by xorps
.
#ifdef __SSE__
float __attribute__((always_inline)) i2f(int v) {
float x;
__asm__("xorps %0, %0; cvtsi2ss %1, %0" : "=x"(x) : "r"(v) );
return x;
}
#else
float __attribute__((always_inline)) i2f(int v) { return (float) v; }
#endif
void stretch(UINT8* lineOut, UINT8* lineIn, int xsize, float *kk)
{
int xx, x;
for (xx = 0; xx < xsize; xx++) {
float ss = 0.0;
for (x = 0; x < xsize; x++) {
ss += i2f(lineIn[x]) * kk[x];
}
lineOut[xx] = (UINT8) ss;
}
}
Results
$ cc -O2 -Wall -m64 ./test.c -o ./test64 && time ./test64
./test64 4.07s user 0.00s system 99% cpu 4.070 total
$ cc -O2 -Wall -m32 ./test.c -o ./test32 && time ./test32
./test32 3.94s user 0.00s system 99% cpu 3.938 total
Here is what I tried: I declared ss as volatile. This prevented the compiler from doing optimizations on it. I got similar times for both 32 and 64 bit versions.
64bit was slightly slower but this is normal because 64bit code is larger and the uCode cache has a finite size. So in general 64bit should be very slightly slower than 32 (<3-4%).
Getting back to the problem, I think that in 32bit mode the compiler makes more aggressive optimizations on ss.
Update 1:
Looking at the 64bit code, it generates a CVTTSS2SI instruction, paired with a CVTSI2SS instruction for float to integer conversion. This has a higher latency. The 32bit code just uses a FMULS instruction, operating directly on floats. Need to look for a compiler option to prevent these conversions.
In 32 bit mode, the compiler is making extra efforts to preserve strict IEEE 754 floating point semantics. You can avoid this by compiling with -ffast-math
:
$ gcc -m32 -O2 -std=c99 test.c && time ./a.out
real 0m13.869s
user 0m13.884s
sys 0m0.000s
$ gcc -m32 -O2 -std=c99 -ffast-math test.c && time ./a.out
real 0m4.477s
user 0m4.480s
sys 0m0.000s
I cannot reproduce your results in 64-bit mode, but I'm pretty confident that -ffast-math
will solve your issues. More generally, unless you really need reproducible IEEE 754 rounding behaviour, -ffast-math
is what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With