I've noticed a significant overhead calling Haskell functions in C, much larger than the overhead of a native C function call. To distill the issue to its essence, I wrote a program that just initializes the Haskell runtime, runs a loop that calls an empty function 100,000,000 times, and returns.
With the function inlined, the program takes 0.003s. Calling an empty function written in C takes 0.18s. Calling an empty function written in Haskell takes 15.5s. (Strangely, if I compile the empty Haskell file separately before linking, it takes a few more seconds. Sub-question: why is this?)
So it looks like there's about a 100-fold slowdown between calling a C function and calling a Haskell function. What is the reason for this, and is there a way to mitigate this slowdown?
EDIT: I've discovered a version of this test in the NoFib benchmark suite,
callback002
. There's a nice blog post by Edward Z. Yang mentioning this test in the context of the GHC scheduler. I'm still trying to grok this blog post along with Zeta's very nice answer. I'm not yet convinced that there's not a way to do this faster!
To compile the "slow" Haskell version, run
ghc -no-hs-main -O2 -optc-O3 test.c Test.hs -o test
To compile the "fast" C version, run
ghc -no-hs-main -O2 -optc-O3 test.c test2.c TestDummy.hs -o test
test.c:
#include "HsFFI.h"
extern void __stginit_Test(void);
extern void test();
int main(int argc, char *argv[]) {
hs_init(&argc, &argv);
hs_add_root(__stginit_Test);
int i;
for (i = 0; i < 100000000; i++) {
test();
}
hs_exit();
return 0;
}
test2.c:
void test() {
}
Test.hs:
{-# LANGUAGE ForeignFunctionInterface #-}
module Test where
foreign export ccall test :: ()
test :: ()
test = ()
TestDummy.hs:
module Test where
This can become overhead if the execution time of function is less than the switching time from the caller function to called function (callee). For functions that are large and/or perform complex tasks, the overhead of the function call is usually insignificant compared to the amount of time the function takes to run.
In computer science, overhead is any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to perform a specific task.
Parameters in Haskell are rather reversed compared to imperative or object oriented languages. In an object oriented language, the object to work on is the very first parameter. In a function call it is often written even before the function name, say file in file.write("bla") .
TL;DR: Reason: RTS and STG calls. Solution: don't call trivial Haskell functions from C.
What is the reason for this…?
Disclaimer: I've never called Haskell from C. I'm familiar with C and Haskell, but I rarely intertwine both, unless I'm writing a wrapper. Now that I've lost my credibility, let's start this adventure of benchmarking, disassembling and other nifty horrors.
One easy way to check what's eating all of your resources is to use gprof. We'll change your compile line slightly so that -pg
is used both by the compiler and the linker (note: I've renamed test.c to main.c and test2.c to test.c):
$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
main.c Test.hs -o test
$ ./test
$ gprof ./test
This gives us the following (flat) profile:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 16.85 2.15 2.15 scheduleWaitThread 11.78 3.65 1.50 createStrictIOThread 7.66 4.62 0.98 createThread 6.68 5.47 0.85 allocate 5.66 6.19 0.72 traverseWeakPtrList 5.34 6.87 0.68 isAlive 4.12 7.40 0.53 newBoundTask 3.06 7.79 0.39 stg_ap_p_fast 2.36 8.09 0.30 stg_ap_v_info 1.96 8.34 0.25 stg_ap_0_fast 1.85 8.57 0.24 rts_checkSchedStatus 1.81 8.80 0.23 stg_PAP_apply 1.73 9.02 0.22 rts_apply 1.73 9.24 0.22 stg_enter_info 1.65 9.45 0.21 stg_stop_thread_info 1.61 9.66 0.21 test 1.49 9.85 0.19 stg_returnToStackTop 1.49 10.04 0.19 move_STACK 1.49 10.23 0.19 stg_ap_v_fast 1.41 10.41 0.18 rts_lock 1.18 10.56 0.15 boundTaskExiting 1.10 10.70 0.14 StgRun 0.98 10.82 0.13 rts_evalIO 0.94 10.94 0.12 stg_upd_frame_info 0.79 11.04 0.10 blockedThrowTo 0.67 11.13 0.09 StgReturn 0.63 11.21 0.08 createIOThread 0.63 11.29 0.08 stg_bh_upd_frame_info 0.63 11.37 0.08 c5KU_info 0.55 11.44 0.07 stg_stk_save_n 0.51 11.50 0.07 threadPaused 0.47 11.56 0.06 dirty_TSO 0.47 11.62 0.06 ghczmprim_GHCziCString_unpackCStringzh_info 0.47 11.68 0.06 stopHeapProfTimer 0.39 11.73 0.05 stg_threadFinished 0.39 11.78 0.05 allocGroup 0.39 11.83 0.05 base_GHCziTopHandler_runNonIO1_info 0.39 11.88 0.05 stg_catchzh 0.35 11.93 0.05 freeMyTask 0.35 11.97 0.05 rts_eval_ 0.31 12.01 0.04 awakenBlockedExceptionQueue 0.31 12.05 0.04 stg_ap_2_upd_info 0.27 12.09 0.04 s5q4_info 0.24 12.12 0.03 markStableTables 0.24 12.15 0.03 rts_getSchedStatus 0.24 12.18 0.03 s5q3_info 0.24 12.21 0.03 scavenge_stack 0.24 12.24 0.03 stg_ap_7_upd_info 0.24 12.27 0.03 stg_ap_n_fast 0.24 12.30 0.03 stg_gc_noregs 0.20 12.32 0.03 base_GHCziTopHandler_runIO1_info 0.20 12.35 0.03 stat_exit 0.16 12.37 0.02 GarbageCollect 0.16 12.39 0.02 dirty_STACK 0.16 12.41 0.02 freeGcThreads 0.16 12.43 0.02 rts_mkString 0.16 12.45 0.02 scavenge_capability_mut_lists 0.16 12.47 0.02 startProfTimer 0.16 12.49 0.02 stg_PAP_info 0.16 12.51 0.02 stg_ap_stk_p 0.16 12.53 0.02 stg_catch_info 0.16 12.55 0.02 stg_killMyself 0.16 12.57 0.02 stg_marked_upd_frame_info 0.12 12.58 0.02 interruptAllCapabilities 0.12 12.60 0.02 scheduleThreadOn 0.12 12.61 0.02 waitForReturnCapability 0.08 12.62 0.01 exitStorage 0.08 12.63 0.01 freeWSDeque 0.08 12.64 0.01 gcStableTables 0.08 12.65 0.01 resetTerminalSettings 0.08 12.66 0.01 resizeNurseriesEach 0.08 12.67 0.01 scavenge_loop 0.08 12.68 0.01 split_free_block 0.08 12.69 0.01 startHeapProfTimer 0.08 12.70 0.01 stg_MVAR_TSO_QUEUE_info 0.08 12.71 0.01 stg_forceIO_info 0.08 12.72 0.01 zero_static_object_list 0.04 12.73 0.01 frame_dummy 0.04 12.73 0.01 rts_evalLazyIO_ 0.00 12.73 0.00 1 0.00 0.00 stginit_export_Test_zdfstableZZC0ZZCmainZZCTestZZCtest
Woah, that's a bunch of functions getting called. How does this compare to your C version?
$ ghc -no-hs-main -O2 -optc-O3 -optc-pg -optl-pg -fforce-recomp \
main.c TestDummy.hs test.c -o test_c
$ ./test_c
$ gprof ./test_c
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 75.00 0.05 0.05 test 25.00 0.06 0.02 frame_dummy
That's a lot simpler. But why?
Maybe you've wondered why test
even showed up in the previous profile. Well, gprof itself adds some overhead, as can be seen with objdump
:
$ objdump -D ./test_c | grep -A5 "<test>:"
0000000000405630 <test>:
405630: 55 push %rbp
405631: 48 89 e5 mov %rsp,%rbp
405634: e8 f7 d4 ff ff callq 402b30 <mcount@plt>
405639: 5d pop %rbp
40563a: c3 retq
The call of mcount
is added by gcc. So for the next part you want to remove the -pg
options. Let's check the disassembled test
routine in C first:
$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \
main.c TestDummy.hs test.c -o test_c
$ objdump -D ./test_c | grep -A2 "<test>:"
0000000000405510 <test>:
405510: f3 c3 repz retq
The repz retq
is actually some optimisation magic, but in this case you can think of it as a (mostly) no-op return.
How does this compare to the Haskell version?
$ ghc -no-hs-main -O2 -optc-O3 -fforce-recomp \
main.c Test.hs -o test_hs
$ objdump -D ./Test.o | grep -A18 "<test>:"
0000000000405520 <test>:
405520: 48 83 ec 18 sub $0x18,%rsp
405524: e8 f7 3a 05 00 callq 459020 <rts_lock>
405529: ba 58 24 6b 00 mov $0x6b2458,%edx
40552e: be 80 28 6b 00 mov $0x6b2880,%esi
405533: 48 89 c7 mov %rax,%rdi
405536: 48 89 04 24 mov %rax,(%rsp)
40553a: e8 51 36 05 00 callq 458b90 <rts_apply>
40553f: 48 8d 54 24 08 lea 0x8(%rsp),%rdx
405544: 48 89 c6 mov %rax,%rsi
405547: 48 89 e7 mov %rsp,%rdi
40554a: e8 01 39 05 00 callq 458e50 <rts_evalIO>
40554f: 48 8b 34 24 mov (%rsp),%rsi
405553: bf 64 57 48 00 mov $0x485764,%edi
405558: e8 23 3a 05 00 callq 458f80 <rts_checkSchedStatus>
40555d: 48 8b 3c 24 mov (%rsp),%rdi
405561: e8 0a 3b 05 00 callq 459070 <rts_unlock>
405566: 48 83 c4 18 add $0x18,%rsp
40556a: c3 retq
40556b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
405570: d8 ce fmul %st(6),%st
This looks rather different. Indeed, the RTS functions seem suspicious. Let's have a look at them:
rts_checkSchedStatus
just checks whether the status is fine, and exits otherwise. The Success
path doesn't have much overhead, so this function isn't really the culprit.rts_unlock
and rts_lock
basically claim and release a capability (a virtual CPU). They call newBoundTask
and boundTaskExiting
, which take some time (see profile above).rts_apply
calls allocate
, one of the most often used functions throughout the program (which isn't really surprising, with Haskell being garbage collected).rts_evalIO
finally creates the actual thread and waits for its completion. So we can estimate that rts_evalIO
alone takes about 27%.So we found all the functions that are taking all the time. The STG and the RTS take all the credit for the overhead of 150ns per call.
…and is there a way to mitigate this slowdown?
Well, your test
is basically a no-op. You're calling it 100000000 times, with a total runtime of 15s. Compared to the C version, that's an overhead of ~149ns per call.
The solution is quite simple: don't use Haskell functions in your C world for trivial tasks. Use the right tool for the right situation. After all, you don't use the GMP library if you need to add two numbers that are guaranteed to be less than 10.
Apart from this paradigmatic solution: no. The assembly shown above is what's created by GHC, and it's not possible to create a variant without RTS calls at the moment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With