I have a small function in written Haskell with the following type:
foreign export ccall sget :: Ptr CInt -> CSize -> Ptr CSize -> IO (Ptr CInt)
I am calling this from multiple C++ threads running concurrently (via TBB). During this part of the execution of my program I can barely get a load average above 1.4 even though I'm running on a six-core CPU (12 logical cores). I therefore suspect that either the calls into Haskell all get funnelled through a single thread, or there is some significant synchronization going on.
I am not doing any such thing explicitly, all the function does is operate on the incoming data (after storing it into a Data.Vector.Storable), and return the result back as a newly allocated array (from Data.Marshal.Array).
Is there anything I need to do to fully enable concurrent calls like this?
I am using GHC 8.6.5 on Debian Linux (bullseye/testing), and I am compiling with -threaded -O2.
Looking forward to reading some advice,
Sebastian
Using the simple example at the end of this answer, if I compile with:
$ ghc -O2 Worker.hs
$ ghc -O2 -threaded Worker.o caller.c -lpthread -no-hs-main -o test
then running it with ./test
occupies only one core at 100%. I need to run it with ./test +RTS -N
, and then on my 4-core desktop, it runs at 400% with a load average of around 4.0.
So, the RTS -N
flag affects the number of parallel threads that can simultaneously run an exported Haskell function and there is no special action required (other than compiling with -threaded
and running with +RTS -n
) to fully utilize all available cores.
So, there must be something about your example that's causing the problem. It could be contention between threads over some shared data structure. Or, maybe parallel garbage collection is causing problems; I've observed parallel GC causing worse performance with increasing -N
in a simple test case (details forgotten, sadly), so you could try turning off parallel GC with -qg
or limiting the number of cores involved with -qn2
or something. To enable these options, you need to call hs_init_with_rtsopts()
in place of the usual hs_init()
as in my example.
If that doesn't work, I think you'll have to try to narrow down the problem and post a minimal example that illustrates the performance issue to get more help.
My example:
caller.c
#include "HsFFI.h"
#include "Rts.h"
#include "Worker_stub.h"
#include <pthread.h>
#define NUM_THREAD 4
void*
work(void* arg)
{
for (;;) {
fibIO(30);
}
}
int
main(int argc, char **argv)
{
hs_init_with_rtsopts(&argc, &argv);
pthread_t threads[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; ++i) {
int rc = pthread_create(&threads[i], NULL, work, NULL);
}
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_join(threads[i], NULL);
}
hs_exit();
return 0;
}
Worker.hs
module Worker where
import Foreign
fibIO :: Int -> IO Int
fibIO = return . fib
fib :: Int -> Int
fib n | n > 1 = fib (n-1) + fib (n-2)
| otherwise = 1
foreign export ccall fibIO :: Int -> IO Int
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With