Please, how to make the evaluation of g
(fib) completely strict? (I know that this exponential solution is not optimal. I would like to know how to make that recursion completely strict /if possible/)
Haskell
g :: Int -> Int
g 0 = 0
g 1 = 1
g x = g(x-1) + g(x-2)
main = print $ g 42
So that it runs approximately as fast as the naive C solution:
C
#include <stdio.h>
long f(int x)
{
if (x == 0) return 0;
if (x == 1) return 1;
return f(x-1) + f(x-2);
}
int main(void)
{
printf("%ld\n", f(42));
return 0;
}
Note: This fibs-recursion is used only as a supersimple example. I totally know, that there are dozens of better algorithms. But there definitely are recursive algorithms which DON'T HAVE so simple and more effective alternatives.
The answer is, GHC makes the evaluation completely strict on its own (when you give it the chance by compiling with optimisations). The original code produces the core
Rec {
Main.$wg [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L]
Main.$wg =
\ (ww_s1JE :: GHC.Prim.Int#) ->
case ww_s1JE of ds_XsI {
__DEFAULT ->
case Main.$wg (GHC.Prim.-# ds_XsI 1) of ww1_s1JI { __DEFAULT ->
case Main.$wg (GHC.Prim.-# ds_XsI 2) of ww2_X1K4 { __DEFAULT ->
GHC.Prim.+# ww1_s1JI ww2_X1K4
}
};
0 -> 0;
1 -> 1
}
end Rec }
which, as you can see if you know GHC's core, is completely strict and uses unboxed raw machine integers.
(Unfortunately, the machine code gcc produces from the C source is just plain faster.)
GHC's strictness analyser is rather good, and in simple cases like here, where there's no polymorphism involved and the function is not too complicated, you can count on it finding that it can unbox all values to produce a worker using unboxed Int#
s.
However, in cases like this, there's more to producing fast code than just operating on machine types. The assembly produced by the native code generator, as well as by the LLVM backend is basically a direct translation of the code to assembly, check whether the argument is 0 or 1, and if not call the function twice and add the results. Both produce some entry and exit code I don't understand, and on my box, the native code generator produces the faster code.
For the C code, clang -O3
produces the straightforward assembly with less cruft and using more registers,
.Ltmp8:
.cfi_offset %r14, -24
movl %edi, %ebx
xorl %eax, %eax
testl %ebx, %ebx
je .LBB0_4
# BB#1:
cmpl $1, %ebx
jne .LBB0_3
# BB#2:
movl $1, %eax
jmp .LBB0_4
.LBB0_3:
leal -1(%rbx), %edi
callq recfib
movq %rax, %r14
addl $-2, %ebx
movl %ebx, %edi
callq recfib
addq %r14, %rax
.LBB0_4:
popq %rbx
popq %r14
popq %rbp
ret
(which for some reason performs much better on my system today than it did yesterday). A lot of the difference in performance between the code produced from the Haskell source and the C comes from the use of registers in the latter case where indirect addressing is used in the former, the core of the algorithm is the same in both.
gcc, without any optimisations produces essentially the same using some indirect addressing, but less than what GHC produced with either the NCG or the LLVM backend. With -O1
, ditto, but with even less indirect addressing. With -O2
, you get a transformation so that the assembly doesn't easily map back to the source, and with -O3
, gcc produces the fairly amazing
.LFB0:
.cfi_startproc
pushq %r15
.cfi_def_cfa_offset 16
.cfi_offset 15, -16
pushq %r14
.cfi_def_cfa_offset 24
.cfi_offset 14, -24
pushq %r13
.cfi_def_cfa_offset 32
.cfi_offset 13, -32
pushq %r12
.cfi_def_cfa_offset 40
.cfi_offset 12, -40
pushq %rbp
.cfi_def_cfa_offset 48
.cfi_offset 6, -48
pushq %rbx
.cfi_def_cfa_offset 56
.cfi_offset 3, -56
subq $120, %rsp
.cfi_def_cfa_offset 176
testl %edi, %edi
movl %edi, 64(%rsp)
movq $0, 16(%rsp)
je .L2
cmpl $1, %edi
movq $1, 16(%rsp)
je .L2
movl %edi, %eax
movq $0, 16(%rsp)
subl $1, %eax
movl %eax, 108(%rsp)
.L3:
movl 108(%rsp), %eax
movq $0, 32(%rsp)
testl %eax, %eax
movl %eax, 72(%rsp)
je .L4
cmpl $1, %eax
movq $1, 32(%rsp)
je .L4
movl 64(%rsp), %eax
movq $0, 32(%rsp)
subl $2, %eax
movl %eax, 104(%rsp)
.L5:
movl 104(%rsp), %eax
movq $0, 24(%rsp)
testl %eax, %eax
movl %eax, 76(%rsp)
je .L6
cmpl $1, %eax
movq $1, 24(%rsp)
je .L6
movl 72(%rsp), %eax
movq $0, 24(%rsp)
subl $2, %eax
movl %eax, 92(%rsp)
.L7:
movl 92(%rsp), %eax
movq $0, 40(%rsp)
testl %eax, %eax
movl %eax, 84(%rsp)
je .L8
cmpl $1, %eax
movq $1, 40(%rsp)
je .L8
movl 76(%rsp), %eax
movq $0, 40(%rsp)
subl $2, %eax
movl %eax, 68(%rsp)
.L9:
movl 68(%rsp), %eax
movq $0, 48(%rsp)
testl %eax, %eax
movl %eax, 88(%rsp)
je .L10
cmpl $1, %eax
movq $1, 48(%rsp)
je .L10
movl 84(%rsp), %eax
movq $0, 48(%rsp)
subl $2, %eax
movl %eax, 100(%rsp)
.L11:
movl 100(%rsp), %eax
movq $0, 56(%rsp)
testl %eax, %eax
movl %eax, 96(%rsp)
je .L12
cmpl $1, %eax
movq $1, 56(%rsp)
je .L12
movl 88(%rsp), %eax
movq $0, 56(%rsp)
subl $2, %eax
movl %eax, 80(%rsp)
.L13:
movl 80(%rsp), %eax
movq $0, 8(%rsp)
testl %eax, %eax
movl %eax, 4(%rsp)
je .L14
cmpl $1, %eax
movq $1, 8(%rsp)
je .L14
movl 96(%rsp), %r15d
movq $0, 8(%rsp)
subl $2, %r15d
.L15:
xorl %r14d, %r14d
testl %r15d, %r15d
movl %r15d, %r13d
je .L16
cmpl $1, %r15d
movb $1, %r14b
je .L16
movl 4(%rsp), %r12d
xorb %r14b, %r14b
subl $2, %r12d
.p2align 4,,10
.p2align 3
.L17:
xorl %ebp, %ebp
testl %r12d, %r12d
movl %r12d, %ebx
je .L18
cmpl $1, %r12d
movb $1, %bpl
je .L18
xorb %bpl, %bpl
jmp .L20
.p2align 4,,10
.p2align 3
.L21:
cmpl $1, %ebx
je .L58
.L20:
leal -1(%rbx), %edi
call recfib
addq %rax, %rbp
subl $2, %ebx
jne .L21
.L18:
addq %rbp, %r14
subl $2, %r13d
je .L16
subl $2, %r12d
cmpl $1, %r13d
jne .L17
addq $1, %r14
.L16:
addq %r14, 8(%rsp)
subl $2, 4(%rsp)
je .L14
subl $2, %r15d
cmpl $1, 4(%rsp)
jne .L15
addq $1, 8(%rsp)
.L14:
movq 8(%rsp), %rax
addq %rax, 56(%rsp)
subl $2, 96(%rsp)
je .L12
subl $2, 80(%rsp)
cmpl $1, 96(%rsp)
jne .L13
addq $1, 56(%rsp)
.L12:
movq 56(%rsp), %rax
addq %rax, 48(%rsp)
subl $2, 88(%rsp)
je .L10
subl $2, 100(%rsp)
cmpl $1, 88(%rsp)
jne .L11
addq $1, 48(%rsp)
.L10:
movq 48(%rsp), %rax
addq %rax, 40(%rsp)
subl $2, 84(%rsp)
je .L8
subl $2, 68(%rsp)
cmpl $1, 84(%rsp)
jne .L9
addq $1, 40(%rsp)
.L8:
movq 40(%rsp), %rax
addq %rax, 24(%rsp)
subl $2, 76(%rsp)
je .L6
subl $2, 92(%rsp)
cmpl $1, 76(%rsp)
jne .L7
addq $1, 24(%rsp)
.L6:
movq 24(%rsp), %rax
addq %rax, 32(%rsp)
subl $2, 72(%rsp)
je .L4
subl $2, 104(%rsp)
cmpl $1, 72(%rsp)
jne .L5
addq $1, 32(%rsp)
.L4:
movq 32(%rsp), %rax
addq %rax, 16(%rsp)
subl $2, 64(%rsp)
je .L2
subl $2, 108(%rsp)
cmpl $1, 64(%rsp)
jne .L3
addq $1, 16(%rsp)
.L2:
movq 16(%rsp), %rax
addq $120, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 56
popq %rbx
.cfi_def_cfa_offset 48
popq %rbp
.cfi_def_cfa_offset 40
popq %r12
.cfi_def_cfa_offset 32
popq %r13
.cfi_def_cfa_offset 24
popq %r14
.cfi_def_cfa_offset 16
popq %r15
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L58:
.cfi_restore_state
addq $1, %rbp
jmp .L18
.cfi_endproc
which is much faster than anything else tested. gcc unrolled the algorithm to a remarkable depth, which neither GHC nor LLVM did, and that makes a huge difference here.
Start by using a better algorithm!
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
fib n = fibs !! n-1
fib 42
will give you an answer much faster.
It's much more important to use a better algorithm than make minor speed tweaks.
You can happily and quickly calculate fib 123456
in ghci (i.e. interpreted, not even compiled) with this definition (it's 25801 digits long). You might get your C code to calculate that faster, but you'll take quite a while writing it. This took me hardly any time at all. I spent much more time writing this post!
Morals:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With