Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Naive Fibonacci in C vs Haskell

Tags:

haskell

ghc

Please, how to make the evaluation of g (fib) completely strict? (I know that this exponential solution is not optimal. I would like to know how to make that recursion completely strict /if possible/)

Haskell

g :: Int -> Int
g 0 = 0
g 1 = 1
g x = g(x-1) + g(x-2)
main = print $ g 42

So that it runs approximately as fast as the naive C solution:

C

#include <stdio.h>

long f(int x)
{
    if (x == 0) return 0;
    if (x == 1) return 1;
    return f(x-1) + f(x-2);
}

int main(void)
{
    printf("%ld\n", f(42));
    return 0;
}

Note: This fibs-recursion is used only as a supersimple example. I totally know, that there are dozens of better algorithms. But there definitely are recursive algorithms which DON'T HAVE so simple and more effective alternatives.

like image 496
Cartesius00 Avatar asked Nov 28 '22 08:11

Cartesius00


2 Answers

The answer is, GHC makes the evaluation completely strict on its own (when you give it the chance by compiling with optimisations). The original code produces the core

Rec {
Main.$wg [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L]
Main.$wg =
  \ (ww_s1JE :: GHC.Prim.Int#) ->
    case ww_s1JE of ds_XsI {
      __DEFAULT ->
        case Main.$wg (GHC.Prim.-# ds_XsI 1) of ww1_s1JI { __DEFAULT ->
        case Main.$wg (GHC.Prim.-# ds_XsI 2) of ww2_X1K4 { __DEFAULT ->
        GHC.Prim.+# ww1_s1JI ww2_X1K4
        }
        };
      0 -> 0;
      1 -> 1
    }
end Rec }

which, as you can see if you know GHC's core, is completely strict and uses unboxed raw machine integers.

(Unfortunately, the machine code gcc produces from the C source is just plain faster.)

GHC's strictness analyser is rather good, and in simple cases like here, where there's no polymorphism involved and the function is not too complicated, you can count on it finding that it can unbox all values to produce a worker using unboxed Int#s.

However, in cases like this, there's more to producing fast code than just operating on machine types. The assembly produced by the native code generator, as well as by the LLVM backend is basically a direct translation of the code to assembly, check whether the argument is 0 or 1, and if not call the function twice and add the results. Both produce some entry and exit code I don't understand, and on my box, the native code generator produces the faster code.

For the C code, clang -O3 produces the straightforward assembly with less cruft and using more registers,

.Ltmp8:
    .cfi_offset %r14, -24
    movl        %edi, %ebx
    xorl        %eax, %eax
    testl       %ebx, %ebx
    je          .LBB0_4
# BB#1:
    cmpl        $1, %ebx
    jne         .LBB0_3
# BB#2:
    movl        $1, %eax
    jmp         .LBB0_4
.LBB0_3:
    leal        -1(%rbx), %edi
    callq       recfib
    movq        %rax, %r14
    addl        $-2, %ebx
    movl        %ebx, %edi
    callq       recfib
    addq        %r14, %rax
.LBB0_4:
    popq        %rbx
    popq        %r14
    popq        %rbp
    ret

(which for some reason performs much better on my system today than it did yesterday). A lot of the difference in performance between the code produced from the Haskell source and the C comes from the use of registers in the latter case where indirect addressing is used in the former, the core of the algorithm is the same in both.

gcc, without any optimisations produces essentially the same using some indirect addressing, but less than what GHC produced with either the NCG or the LLVM backend. With -O1, ditto, but with even less indirect addressing. With -O2, you get a transformation so that the assembly doesn't easily map back to the source, and with -O3, gcc produces the fairly amazing

.LFB0:
    .cfi_startproc
    pushq   %r15
    .cfi_def_cfa_offset 16
    .cfi_offset 15, -16
    pushq   %r14
    .cfi_def_cfa_offset 24
    .cfi_offset 14, -24
    pushq   %r13
    .cfi_def_cfa_offset 32
    .cfi_offset 13, -32
    pushq   %r12
    .cfi_def_cfa_offset 40
    .cfi_offset 12, -40
    pushq   %rbp
    .cfi_def_cfa_offset 48
    .cfi_offset 6, -48
    pushq   %rbx
    .cfi_def_cfa_offset 56
    .cfi_offset 3, -56
    subq    $120, %rsp
    .cfi_def_cfa_offset 176
    testl   %edi, %edi
    movl    %edi, 64(%rsp)
    movq    $0, 16(%rsp)
    je      .L2
    cmpl    $1, %edi
    movq    $1, 16(%rsp)
    je      .L2
    movl    %edi, %eax
    movq    $0, 16(%rsp)
    subl    $1, %eax
    movl    %eax, 108(%rsp)
.L3:
    movl    108(%rsp), %eax
    movq    $0, 32(%rsp)
    testl   %eax, %eax
    movl    %eax, 72(%rsp)
    je      .L4
    cmpl    $1, %eax
    movq    $1, 32(%rsp)
    je      .L4
    movl    64(%rsp), %eax
    movq    $0, 32(%rsp)
    subl    $2, %eax
    movl    %eax, 104(%rsp)
.L5:
    movl    104(%rsp), %eax
    movq    $0, 24(%rsp)
    testl   %eax, %eax
    movl    %eax, 76(%rsp)
    je      .L6
    cmpl    $1, %eax
    movq    $1, 24(%rsp)
    je      .L6
    movl    72(%rsp), %eax
    movq    $0, 24(%rsp)
    subl    $2, %eax
    movl    %eax, 92(%rsp)
.L7:
    movl    92(%rsp), %eax
    movq    $0, 40(%rsp)
    testl   %eax, %eax
    movl    %eax, 84(%rsp)
    je      .L8
    cmpl    $1, %eax
    movq    $1, 40(%rsp)
    je      .L8
    movl    76(%rsp), %eax
    movq    $0, 40(%rsp)
    subl    $2, %eax
    movl    %eax, 68(%rsp)
.L9:
    movl    68(%rsp), %eax
    movq    $0, 48(%rsp)
    testl   %eax, %eax
    movl    %eax, 88(%rsp)
    je      .L10
    cmpl    $1, %eax
    movq    $1, 48(%rsp)
    je      .L10
    movl    84(%rsp), %eax
    movq    $0, 48(%rsp)
    subl    $2, %eax
    movl    %eax, 100(%rsp)
.L11:
    movl    100(%rsp), %eax
    movq    $0, 56(%rsp)
    testl   %eax, %eax
    movl    %eax, 96(%rsp)
    je      .L12
    cmpl    $1, %eax
    movq    $1, 56(%rsp)
    je      .L12
    movl    88(%rsp), %eax
    movq    $0, 56(%rsp)
    subl    $2, %eax
    movl    %eax, 80(%rsp)
.L13:
    movl    80(%rsp), %eax
    movq    $0, 8(%rsp)
    testl   %eax, %eax
    movl    %eax, 4(%rsp)
    je      .L14
    cmpl    $1, %eax
    movq    $1, 8(%rsp)
    je      .L14
    movl    96(%rsp), %r15d
    movq    $0, 8(%rsp)
    subl    $2, %r15d
.L15:
    xorl    %r14d, %r14d
    testl   %r15d, %r15d
    movl    %r15d, %r13d
    je      .L16
    cmpl    $1, %r15d
    movb    $1, %r14b
    je      .L16
    movl    4(%rsp), %r12d
    xorb    %r14b, %r14b
    subl    $2, %r12d
    .p2align 4,,10
    .p2align 3
.L17:
    xorl    %ebp, %ebp
    testl   %r12d, %r12d
    movl    %r12d, %ebx
    je      .L18
    cmpl    $1, %r12d
    movb    $1, %bpl
    je      .L18
    xorb    %bpl, %bpl
    jmp     .L20
    .p2align 4,,10
    .p2align 3
.L21:
    cmpl    $1, %ebx
    je      .L58
.L20:
    leal    -1(%rbx), %edi
    call    recfib
    addq    %rax, %rbp
    subl    $2, %ebx
    jne     .L21
.L18:
    addq    %rbp, %r14
    subl    $2, %r13d
    je      .L16
    subl    $2, %r12d
    cmpl    $1, %r13d
    jne     .L17
    addq    $1, %r14
.L16:
    addq    %r14, 8(%rsp)
    subl    $2, 4(%rsp)
    je      .L14
    subl    $2, %r15d
    cmpl    $1, 4(%rsp)
    jne     .L15
    addq    $1, 8(%rsp)
.L14:
    movq    8(%rsp), %rax
    addq    %rax, 56(%rsp)
    subl    $2, 96(%rsp)
    je      .L12
    subl    $2, 80(%rsp)
    cmpl    $1, 96(%rsp)
    jne     .L13
    addq    $1, 56(%rsp)
.L12:
    movq    56(%rsp), %rax
    addq    %rax, 48(%rsp)
    subl    $2, 88(%rsp)
    je      .L10
    subl    $2, 100(%rsp)
    cmpl    $1, 88(%rsp)
    jne     .L11
    addq    $1, 48(%rsp)
.L10:
    movq    48(%rsp), %rax
    addq    %rax, 40(%rsp)
    subl    $2, 84(%rsp)
    je      .L8
    subl    $2, 68(%rsp)
    cmpl    $1, 84(%rsp)
    jne     .L9
    addq    $1, 40(%rsp)
.L8:
    movq    40(%rsp), %rax
    addq    %rax, 24(%rsp)
    subl    $2, 76(%rsp)
    je      .L6
    subl    $2, 92(%rsp)
    cmpl    $1, 76(%rsp)
    jne     .L7
    addq    $1, 24(%rsp)
.L6:
    movq    24(%rsp), %rax
    addq    %rax, 32(%rsp)
    subl    $2, 72(%rsp)
    je      .L4
    subl    $2, 104(%rsp)
    cmpl    $1, 72(%rsp)
    jne     .L5
    addq    $1, 32(%rsp)
.L4:
    movq    32(%rsp), %rax
    addq    %rax, 16(%rsp)
    subl    $2, 64(%rsp)
    je      .L2
    subl    $2, 108(%rsp)
    cmpl    $1, 64(%rsp)
    jne     .L3
    addq    $1, 16(%rsp)
.L2:
    movq    16(%rsp), %rax
    addq    $120, %rsp
    .cfi_remember_state
    .cfi_def_cfa_offset 56
    popq    %rbx
    .cfi_def_cfa_offset 48
    popq    %rbp
    .cfi_def_cfa_offset 40
    popq    %r12
    .cfi_def_cfa_offset 32
    popq    %r13
    .cfi_def_cfa_offset 24
    popq    %r14
    .cfi_def_cfa_offset 16
    popq    %r15
    .cfi_def_cfa_offset 8
    ret
    .p2align 4,,10
    .p2align 3
.L58:
    .cfi_restore_state
    addq    $1, %rbp
    jmp     .L18
    .cfi_endproc

which is much faster than anything else tested. gcc unrolled the algorithm to a remarkable depth, which neither GHC nor LLVM did, and that makes a huge difference here.

like image 144
Daniel Fischer Avatar answered Dec 09 '22 14:12

Daniel Fischer


Start by using a better algorithm!

fibs = 0 : 1 : zipWith (+) fibs (tail fibs)

fib n = fibs !! n-1

fib 42 will give you an answer much faster.

It's much more important to use a better algorithm than make minor speed tweaks.

You can happily and quickly calculate fib 123456 in ghci (i.e. interpreted, not even compiled) with this definition (it's 25801 digits long). You might get your C code to calculate that faster, but you'll take quite a while writing it. This took me hardly any time at all. I spent much more time writing this post!

Morals:

  1. Use the right algorithm!
  2. Haskell lets you write clean versions of code, memoising answers simply.
  3. It's sometimes easier to define an infinite list of answers and grab the one you want than to write some looping version that updates values.
  4. Haskell is awesome.
like image 37
AndrewC Avatar answered Dec 09 '22 15:12

AndrewC