Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance considerations of Haskell FFI / C?

If using Haskell as a library being called from my C program, what is the performance impact of making calls in to it? For instance if I have a problem world data set of say 20kB of data, and I want to run something like:

// Go through my 1000 actors and have them make a decision based on
// HaskellCode() function, which is compiled Haskell I'm accessing through
// the FFI.  As an argument, send in the SAME 20kB of data to EACH of these
// function calls, and some actor specific data
// The 20kB constant data defines the environment and the actor specific
// data could be their personality or state
for(i = 0; i < 1000; i++)
   actor[i].decision = HaskellCode(20kB of data here, actor[i].personality);

What's going to happen here - is it going to be possible for me to keep that 20kB of data as a global immutable reference somewhere that is accessed by the Haskell code, or must I create a copy of that data each time through?

The concern is that this data could be larger, much larger - I also hope to write algorithms that act on much larger sets of data, using the same pattern of immutable data being used by several calls of the Haskell code.

Also, I'd like to parallelize this, like a dispatch_apply() GCD or Parallel.ForEach(..) C#. My rationale for parallelization outside of Haskell is that I know I will always be operating on many separate function calls i.e. 1000 actors, so using fine-grained parallelization inside Haskell function is no better than managing it at the C level. Is running FFI Haskell instances 'Thread Safe' and how do I achieve this - do I need to initialize a Haskell instance every time I kick off a parallel run? (Seems slow if I must..) How do I achieve this with good performance?

like image 506
Nektarios Avatar asked Apr 14 '11 14:04

Nektarios


3 Answers

what is the performance impact of making calls in to it

Assuming you start the Haskell runtime up only once (like this), on my machine, making a function call from C into Haskell, passing an Int back and forth across the boundary, takes about 80,000 cycles (31,000 ns on my Core 2) -- determined experimentally via the rdstc register

is it going to be possible for me to keep that 20kB of data as a global immutable reference somewhere that is accessed by the Haskell code

Yes, that is certainly possible. If the data really is immutable, then you get the same result whether you:

  • thread the data back and forth across the language boundary by marshalling;
  • pass a reference to the data back and forth;
  • or cache it in an IORef on the Haskell side.

Which strategy is best? It depends on the data type. The most idiomatic way would be to pass a reference to the C data back and forth, treating it as a ByteString or Vector on the Haskell side.

I'd like to parallelize this

I'd strongly recommend inverting the control then, and doing the parallelization from the Haskell runtime -- it'll be much more robust, as that path has been heavily tested.

Regarding thread safety, it is apparently safe to make parallel calls to foreign exported functions running in the same runtime -- though fairly sure no one has tried this in order to gain parallelism. Calls in acquire a capability, which is essentially a lock, so multiple calls may block, reducing your chances for parallelism. In the multicore case (e.g. -N4 or so) your results may be different (multiple capabilities are available), however, this is almost certainly a bad way to improve performance.

Again, making many parallel functions calls from Haskell via forkIO is a better documented, better tested path, with less overhead than doing the work on the C side, and probably less code in the end.

Just make a call into your Haskell function, that in turn will do the parallelism via many Haskell threads. Easy!

like image 97
Don Stewart Avatar answered Oct 14 '22 13:10

Don Stewart


I use a mix of C and Haskell threads for one of my applications and haven't noticed that much of a performance hit switching between the two. So I crafted a simple benchmark... which is quite a bit faster/cheaper than Don's. This is measuring 10 million iterations on a 2.66GHz i7:

$ ./foo
IO  : 2381952795 nanoseconds total, 238.195279 nanoseconds per, 160000000 value
Pure: 2188546976 nanoseconds total, 218.854698 nanoseconds per, 160000000 value

Compiled with GHC 7.0.3/x86_64 and gcc-4.2.1 on OSX 10.6

ghc -no-hs-main -lstdc++ -O2 -optc-O2 -o foo ForeignExportCost.hs Driver.cpp

Haskell:

{-# LANGUAGE ForeignFunctionInterface #-}

module ForeignExportCost where

import Foreign.C.Types

foreign export ccall simpleFunction :: CInt -> CInt
simpleFunction i = i * i

foreign export ccall simpleFunctionIO :: CInt -> IO CInt
simpleFunctionIO i = return (i * i)

And an OSX C++ app to drive it, should be simple to adjust to Windows or Linux:

#include <stdio.h>
#include <mach/mach_time.h>
#include <mach/kern_return.h>
#include <HsFFI.h>
#include "ForeignExportCost_stub.h"

static const int s_loop = 10000000;

int main(int argc, char** argv) {
    hs_init(&argc, &argv);

    struct mach_timebase_info timebase_info = { };
    kern_return_t err;
    err = mach_timebase_info(&timebase_info);
    if (err != KERN_SUCCESS) {
        fprintf(stderr, "error: %x\n", err);
        return err;
    }

    // timing a function in IO
    uint64_t start = mach_absolute_time();
    HsInt32 val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunctionIO(4);
    }

    // in nanoseconds per http://developer.apple.com/library/mac/#qa/qa1398/_index.html
    uint64_t duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    double duration_per = static_cast<double>(duration) / s_loop;
    printf("IO  : %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    // run the loop again with a pure function
    start = mach_absolute_time();
    val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunction(4);
    }

    duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    duration_per = static_cast<double>(duration) / s_loop;
    printf("Pure: %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    hs_exit();
}
like image 9
Nathan Howell Avatar answered Oct 14 '22 13:10

Nathan Howell


Haskell can peek into that 20k blob if you pass the pointer.

like image 3
Volker Stolz Avatar answered Oct 14 '22 13:10

Volker Stolz