I am working on a language that is compiled with LLVM. Just for fun, I wanted to do some microbenchmarks. In one, I run some million sin / cos computations in a loop. In pseudocode, it looks like this:
var x: Double = 0.0
for (i <- 0 to 100 000 000)
x = sin(x)^2 + cos(x)^2
return x.toInteger
If I'm computing sin/cos using LLVM IR inline assembly in the form:
%sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind
this is faster than using fsin and fcos separately instead of fsincos. However, it is slower than if I calling the llvm.sin.f64
and llvm.cos.f64
intrinsics separately, which compile to calls to the C math lib functions, at least with the target settings I'm using (x86_64 with SSE enabled).
It seems LLVM inserts some conversions between single/double precision FP -- that might be the culprit. Why is that? Sorry, I'm a relative newbie at assembly:
.globl main
.align 16, 0x90
.type main,@function
main: # @main
.cfi_startproc
# BB#0: # %loopEntry1
xorps %xmm0, %xmm0
movl $-1, %eax
jmp .LBB44_1
.align 16, 0x90
.LBB44_2: # %then4
# in Loop: Header=BB44_1 Depth=1
movss %xmm0, -4(%rsp)
flds -4(%rsp)
#APP
fsincos
#NO_APP
fstpl -16(%rsp)
fstpl -24(%rsp)
movsd -16(%rsp), %xmm0
mulsd %xmm0, %xmm0
cvtsd2ss %xmm0, %xmm1
movsd -24(%rsp), %xmm0
mulsd %xmm0, %xmm0
cvtsd2ss %xmm0, %xmm0
addss %xmm1, %xmm0
.LBB44_1: # %loop2
# =>This Inner Loop Header: Depth=1
incl %eax
cmpl $99999999, %eax # imm = 0x5F5E0FF
jle .LBB44_2
# BB#3: # %break3
cvttss2si %xmm0, %eax
ret
.Ltmp160:
.size main, .Ltmp160-main
.cfi_endproc
Same test with calls to llvm sin/cos intrinsics:
.globl main
.align 16, 0x90
.type main,@function
main: # @main
.cfi_startproc
# BB#0: # %loopEntry1
pushq %rbx
.Ltmp162:
.cfi_def_cfa_offset 16
subq $16, %rsp
.Ltmp163:
.cfi_def_cfa_offset 32
.Ltmp164:
.cfi_offset %rbx, -16
xorps %xmm0, %xmm0
movl $-1, %ebx
jmp .LBB44_1
.align 16, 0x90
.LBB44_2: # %then4
# in Loop: Header=BB44_1 Depth=1
movsd %xmm0, (%rsp) # 8-byte Spill
callq cos
mulsd %xmm0, %xmm0
movsd %xmm0, 8(%rsp) # 8-byte Spill
movsd (%rsp), %xmm0 # 8-byte Reload
callq sin
mulsd %xmm0, %xmm0
addsd 8(%rsp), %xmm0 # 8-byte Folded Reload
.LBB44_1: # %loop2
# =>This Inner Loop Header: Depth=1
incl %ebx
cmpl $99999999, %ebx # imm = 0x5F5E0FF
jle .LBB44_2
# BB#3: # %break3
cvttsd2si %xmm0, %eax
addq $16, %rsp
popq %rbx
ret
.Ltmp165:
.size main, .Ltmp165-main
.cfi_endproc
Can you suggest how the ideal assembly would look like with fsincos? PS. Adding -enable-unsafe-fp-math to llc makes the conversions disappear and switches to doubles (fldl etc.), but the speed remains the same.
.globl main
.align 16, 0x90
.type main,@function
main: # @main
.cfi_startproc
# BB#0: # %loopEntry1
xorps %xmm0, %xmm0
movl $-1, %eax
jmp .LBB44_1
.align 16, 0x90
.LBB44_2: # %then4
# in Loop: Header=BB44_1 Depth=1
movsd %xmm0, -8(%rsp)
fldl -8(%rsp)
#APP
fsincos
#NO_APP
fstpl -24(%rsp)
fstpl -16(%rsp)
movsd -24(%rsp), %xmm1
mulsd %xmm1, %xmm1
movsd -16(%rsp), %xmm0
mulsd %xmm0, %xmm0
addsd %xmm1, %xmm0
.LBB44_1: # %loop2
# =>This Inner Loop Header: Depth=1
incl %eax
cmpl $99999999, %eax # imm = 0x5F5E0FF
jle .LBB44_2
# BB#3: # %break3
cvttsd2si %xmm0, %eax
ret
.Ltmp160:
.size main, .Ltmp160-main
.cfi_endproc
Too many documents claim that x87 instructions like fsin
or fsincos
are the fastest way to do trigonometric functions. Those claims are often wrong.
The fastest way depends on your CPU. As CPUs become faster, old hardware trig instructions like fsin
have not kept pace. With some CPUs, a software function, using a polynomial approximation for sine or another trig function, is now faster than a hardware instruction.
In short, fsincos
is too slow.
There is enough evidence that the x86-64 platform has moved away from hardware trig.
fsin
.fsin
. NetBSD and OpenBSD made the opposite choice: their libm for amd64 does use x87 instructions.fsin
in its x86 backend but not in its x86-64 backend. For x86-64, SBCL compiles code that calls sin() in libm.I timed hardware and software sine on an AMD Phenom II X2 560 (3.3 GHz) from 2010. I wrote a C program with this loop:
volatile double a, s;
/* ... */
for (i = 0; i < 100000000; i++)
s = sin(a);
I compiled this program twice, with two different implementations of sin(). The hard sin() uses x87 fsin
. The soft sin() uses a polynomial approximation. My C compiler, gcc -O2
, did not replace my sin() call with an inline fsin
.
Here are results for sin(0.5):
$ time race-hard 0.5
0m3.40s real 0m3.40s user 0m0.00s system
$ time race-soft 0.5
0m1.13s real 0m1.15s user 0m0.00s system
Here soft sin(0.5) is so fast, this CPU would do soft sin(0.5) and soft cos(0.5) faster than one x87 fsin
.
And for sin(123):
$ time race-hard 123
0m3.61s real 0m3.62s user 0m0.00s system
$ time race-soft 123
0m3.08s real 0m3.07s user 0m0.01s system
Soft sin(123) is slower than soft sin(0.5) because 123 is too large for the polynomial, so the function must subtract some multiple of 2π. If I also want cos(123), there is a chance that x87 fsincos
would be faster than soft sin(123) and soft cos(123), for this CPU from 2010.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With