I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.
I tried doing this with Haskell to see how it compares to C.
C code:
#include <stdio.h>
#include <stdlib.h>
int fib (int n) {
if (n < 2) return 1;
return fib (n-1) + fib (n-2);
}
int main (int argc, char* argv[]) {
printf ("%i\n", fib (atoi(argv[1])));
return 0;
}
Result:
> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real 0m0.421s
user 0m0.420s
sys 0m0.000s
Haskell:
module Main where
import System.Environment (getArgs)
fib :: Int -> Int
fib n | n < 2 = 1
| otherwise = fib (n-1) + fib (n-2)
main = getArgs >>= print . fib . read . head
Result:
> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real 0m1.476s
user 0m1.476s
sys 0m0.000s
Profiling with
> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof
reveals that fib
takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:
> ./fib 40 +RTS -hc
> ./fib 40 +RTS -hd
So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)
Thank you very much.
[EDIT]
It turned out that ghc -O3
was faster than ghc -O3 -fllvm -optlo-O3
in this case. But optlo-block-placement
made an observable difference for the LLVM backend:
> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.283s
user 0m1.284s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.449s
user 0m1.448s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.112s
user 0m1.096s
sys 0m0.016s
The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D
> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real 0m0.668s
user 0m0.660s
sys 0m0.008s
The heap profile is not very interesting here, as GHC compiles fib
into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main
implementation.
As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:
_Main_zdwfib_info:
.Lc1RK:
leal -8(%ebp),%eax
cmpl 84(%ebx),%eax
jb .Lc1RM
This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.
cmpl $2,0(%ebp)
jl .Lc1RO
The comparison against 2 from your code.
movl 0(%ebp),%eax
decl %eax
The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.
movl %eax,-8(%ebp)
movl $_s1Qp_info,-4(%ebp)
addl $-8,%ebp
jmp _Main_zdwfib_info
The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.
.Lc1RO:
movl $1,%esi
addl $4,%ebp
jmp *0(%ebp)
The code for the case that the parameter is less than 2. The return value is passed in a register.
Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.
These seems like a really feeble 'benchmark' as barsoap
says. Suppose I compare the following almost equally naive programs:
module Main where
import System.Environment (getArgs)
fib :: Int -> Int
fib 0 = 1
fib 1 = 1
fib 2 = 2
fib n = (\x y -> x + y + y ) (fib (n-3)) (fib (n-2) )
main = getArgs >>= print . fib . read . head
... and in the other corner ...
#include <stdio.h>
#include <stdlib.h>
int fib (int n) {
if (n < 2) return 1;
if (n < 3) return n;
return (fib (n-3) + fib (n-2)) + fib (n-2);
}
int main (int argc, char* argv[]) {
printf ("%i\n", fib (atoi(argv[1])));
return 0;
}
Then the Glorious ghc
crushes gcc
, not too surprisingly, really:
$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141
real 0m0.013s
user 0m0.007s
sys 0m0.003s
$ gcc fib.c -o fibc
$ time ./fibc 40
165580141
real 0m1.495s
user 0m1.483s
sys 0m0.003s
now turning on optimizations, ghc
picks up a little more speed:
$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141
real 0m0.007s
user 0m0.002s
sys 0m0.004s
but gcc
finally gets a clue.
$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141
real 0m0.007s
user 0m0.004s
sys 0m0.002s
I think the explanation is the ghc
's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With