Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

On improving Haskell's performance compared to C in fibonacci micro-benchmark

I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.

I tried doing this with Haskell to see how it compares to C.

C code:

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  return fib (n-1) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Result:

> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real    0m0.421s
user    0m0.420s
sys 0m0.000s

Haskell:

module Main where
import System.Environment (getArgs)

fib :: Int -> Int
fib n | n < 2 = 1
      | otherwise = fib (n-1) + fib (n-2)

main = getArgs >>= print . fib . read . head

Result:

> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real    0m1.476s
user    0m1.476s
sys 0m0.000s

Profiling with

> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof

reveals that fib takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:

> ./fib 40 +RTS -hc

enter image description here

> ./fib 40 +RTS -hd

enter image description here

So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)

Thank you very much.

[EDIT]

It turned out that ghc -O3 was faster than ghc -O3 -fllvm -optlo-O3 in this case. But optlo-block-placement made an observable difference for the LLVM backend:

> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.283s
user    0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.449s
user    0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.112s
user    0m1.096s
sys 0m0.016s

The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D

> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real    0m0.668s
user    0m0.660s
sys 0m0.008s
like image 997
Phil Avatar asked Jul 16 '11 08:07

Phil


2 Answers

The heap profile is not very interesting here, as GHC compiles fib into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main implementation.

As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:

_Main_zdwfib_info:
.Lc1RK:
    leal -8(%ebp),%eax
    cmpl 84(%ebx),%eax
    jb .Lc1RM

This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.

    cmpl $2,0(%ebp)
    jl .Lc1RO

The comparison against 2 from your code.

    movl 0(%ebp),%eax
    decl %eax

The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.

    movl %eax,-8(%ebp)
    movl $_s1Qp_info,-4(%ebp)
    addl $-8,%ebp
    jmp _Main_zdwfib_info

The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.

.Lc1RO:
    movl $1,%esi
    addl $4,%ebp
    jmp *0(%ebp)

The code for the case that the parameter is less than 2. The return value is passed in a register.

Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.

like image 85
Peter Wortmann Avatar answered Oct 20 '22 13:10

Peter Wortmann


These seems like a really feeble 'benchmark' as barsoap says. Suppose I compare the following almost equally naive programs:

module Main where
import System.Environment (getArgs)

fib ::  Int ->  Int
fib 0 = 1
fib 1 = 1
fib 2 = 2 
fib n = (\x y -> x + y + y )  (fib (n-3))  (fib (n-2) )

main = getArgs >>= print . fib . read . head  

... and in the other corner ...

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  if (n < 3) return n;
  return (fib (n-3) + fib (n-2)) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Then the Glorious ghc crushes gcc, not too surprisingly, really:

$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main             ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141

real    0m0.013s
user    0m0.007s
sys 0m0.003s

$ gcc fib.c -o fibc
$ time ./fibc 40
165580141

real    0m1.495s
user    0m1.483s
sys 0m0.003s

now turning on optimizations, ghc picks up a little more speed:

$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141

real    0m0.007s
user    0m0.002s
sys 0m0.004s

but gcc finally gets a clue.

$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141

real    0m0.007s
user    0m0.004s
sys 0m0.002s

I think the explanation is the ghc's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.

like image 33
applicative Avatar answered Oct 20 '22 15:10

applicative