Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calling into Haskell from multiple C/C++ threads

I have a small function in written Haskell with the following type:

foreign export ccall sget :: Ptr CInt -> CSize -> Ptr CSize -> IO (Ptr CInt)

I am calling this from multiple C++ threads running concurrently (via TBB). During this part of the execution of my program I can barely get a load average above 1.4 even though I'm running on a six-core CPU (12 logical cores). I therefore suspect that either the calls into Haskell all get funnelled through a single thread, or there is some significant synchronization going on.

I am not doing any such thing explicitly, all the function does is operate on the incoming data (after storing it into a Data.Vector.Storable), and return the result back as a newly allocated array (from Data.Marshal.Array).

Is there anything I need to do to fully enable concurrent calls like this?

I am using GHC 8.6.5 on Debian Linux (bullseye/testing), and I am compiling with -threaded -O2.

Looking forward to reading some advice,

Sebastian

like image 780
sekoenig Avatar asked Mar 16 '20 19:03

sekoenig


1 Answers

Using the simple example at the end of this answer, if I compile with:

$ ghc -O2 Worker.hs
$ ghc -O2 -threaded Worker.o caller.c -lpthread -no-hs-main -o test

then running it with ./test occupies only one core at 100%. I need to run it with ./test +RTS -N, and then on my 4-core desktop, it runs at 400% with a load average of around 4.0.

So, the RTS -N flag affects the number of parallel threads that can simultaneously run an exported Haskell function and there is no special action required (other than compiling with -threaded and running with +RTS -n) to fully utilize all available cores.

So, there must be something about your example that's causing the problem. It could be contention between threads over some shared data structure. Or, maybe parallel garbage collection is causing problems; I've observed parallel GC causing worse performance with increasing -N in a simple test case (details forgotten, sadly), so you could try turning off parallel GC with -qg or limiting the number of cores involved with -qn2 or something. To enable these options, you need to call hs_init_with_rtsopts() in place of the usual hs_init() as in my example.

If that doesn't work, I think you'll have to try to narrow down the problem and post a minimal example that illustrates the performance issue to get more help.

My example:

caller.c
#include "HsFFI.h"
#include "Rts.h"
#include "Worker_stub.h"
#include <pthread.h>

#define NUM_THREAD 4

void*
work(void* arg)
{
        for (;;) {
                fibIO(30);
        }
}

int
main(int argc, char **argv)
{
        hs_init_with_rtsopts(&argc, &argv);

        pthread_t threads[NUM_THREAD];
        for (int i = 0; i < NUM_THREAD; ++i) {
                int rc = pthread_create(&threads[i], NULL, work, NULL);
        }
        for (int i = 0; i < NUM_THREAD; ++i) {
                pthread_join(threads[i], NULL);
        }

        hs_exit();
        return 0;
}
Worker.hs
module Worker where

import Foreign

fibIO :: Int -> IO Int
fibIO = return . fib

fib :: Int -> Int
fib n | n > 1 = fib (n-1) + fib (n-2)
      | otherwise = 1

foreign export ccall fibIO :: Int -> IO Int
like image 86
K. A. Buhr Avatar answered Oct 16 '22 14:10

K. A. Buhr