The most straightforward monadic 'stream' is just a list of monadic actions Monad m => [m a]
. The sequence :: [m a] -> m [a]
function evaluates each monadic action and collects the results. As it turns out, sequence
is not very efficient, though, because it operates on lists, and the monad is an obstacle to achieving fusion in anything but the simplest cases.
The question is: What is the most efficient approach for monadic streams?
To investigate this, I provide a toy problem along with a few attempts to improve performance. The source code can be found on github. The singular benchmark presented below may be misleading for more realistic problems, although I think it is a worst-case scenario of sorts, i.e. the most possible overhead per useful computation.
The toy problem
is a maximum length 16-bit Linear Feedback Shift Register (LFSR), implemented in C in a somewhat over-elaborate way, with a Haskell wrapper in the IO
monad. 'Over-elaborate' refers to the unnecessary use of a struct
and its malloc
- the purpose of this complication is to make it more similar to realistic situations where all you have is a Haskell wrapper around a FFI to a C struct
with OO-ish new
, set
, get
, operate
semantics (i.e. very much the imperative style). A typical Haskell program looks like this:
import LFSR
main = do
lfsr <- newLFSR -- make a LFSR object
setLFSR lfsr 42 -- initialise it with 42
stepLFSR lfsr -- do one update
getLFSR lfsr >>= print -- extract the new value and print
The default task is to calculate the average of the values (doubles) of 10'000'000 iterations of the LFSR. This task could be part of a suite of tests to analyse the 'randomness` of this stream of 16-bit integers.
0. Baseline
The baseline is the C implementation of the average over n
iterations:
double avg(state_t* s, int n) {
double sum = 0;
for(int i=0; i<n; i++, sum += step_get_lfsr(s));
return sum / (double)n;
}
The C implementation isn't meant to be particularly good, or fast. It just provides a meaningful computation.
1. Haskell lists
Compared to the C baseline, on this task Haskell lists are 73x slower.
=== RunAvg =========
Baseline: 1.874e-2
IO: 1.382488
factor: 73.77203842049093
This is the implementation (RunAvg.hs
):
step1 :: LFSR -> IO Word32
step1 lfsr = stepLFSR lfsr >> getLFSR lfsr
avg :: LFSR -> Int -> IO Double
avg lfsr n = mean <$> replicateM n (step1 lfsr) where
mean :: [Word32] -> Double
mean vs = (sum $ fromIntegral <$> vs) / (fromIntegral n)
2. Using the streaming
library
This gets us to within 9x of the baseline,
=== RunAvgStreaming ===
Baseline: 1.9391e-2
IO: 0.168126
factor: 8.670310969006241
(Note that the benchmarks are rather inaccurate at these short execution times.)
This is the implementation (RunAvgStreaming.hs
):
import qualified Streaming.Prelude as S
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
let stream = S.replicateM n (fromIntegral <$> step1 lfsr :: IO Double)
(mySum :> _) <- S.sum stream
return (mySum / fromIntegral n)
3. Using Data.Vector.Fusion.Stream.Monadic
This gives the best performance so far, within 3x of baseline,
=== RunVector =========
Baseline: 1.9986e-2
IO: 4.9146e-2
factor: 2.4590213149204443
As usual, here is the implementation (RunAvgVector.hs
):
import qualified Data.Vector.Fusion.Stream.Monadic as V
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
let stream = V.replicateM n (step1' lfsr)
V.foldl (+) 0.0 stream
I didn't expect to find a good monadic stream implementation under Data.Vector
. Other than providing fromVector
and concatVectors
, Data.Vector.Fusion.Stream.Monadic
has very little to do with Vector
from Data.Vector
.
A look at the profiling report shows that Data.Vector.Fusion.Stream.Monadic
has a considerable space leak, but that doesn't sound right.
4. Lists aren't necessarily slow
For very simple operations lists aren't terrible at all:
=== RunRepeat =======
Baseline: 1.8078e-2
IO: 3.6253e-2
factor: 2.0053656377917912
Here, the for loop is done in Haskell instead of pushing it down to C (RunRepeat.hs
):
do
setLFSR lfsr 42
replicateM_ nIter (stepLFSR lfsr)
getLFSR lfsr
This is just a repetition of calls to stepLFSR
without passing the result back to the Haskell layer. It gives an indication of what impact the overhead for calling the wrapper and the FFI has.
Analysis
The repeat
example above shows that most, but not all (?), of the performance penalty comes from overhead of calling the wrapper and/or the FFI. But I'm not sure where to look for tweaks, now. Maybe this is just as good as it gets with regards to monadic streams, and in fact this is all about trimming down the FFI, now...
Sidenotes
Update 1
An attempt to remove the withForeignPtr
calls can be made by introducing a
Storable
and then using alloca :: Storable a => (Ptr a -> IO b) -> IO b
repeatSteps :: Word32 -> Int -> IO Word32
repeatSteps start n = alloca rep where
rep :: Ptr LFSRStruct' -> IO Word32
rep p = do
setLFSR2 p start
(sequence_ . (replicate n)) (stepLFSR2 p)
getLFSR2 p
where LFSRStruct'
is
data LFSRStruct' = LFSRStruct' CUInt
and the wrapper is
foreign import ccall unsafe "lfsr.h set_lfsr"
setLFSR2 :: Ptr LFSRStruct' -> Word32 -> IO ()
-- likewise for setLFSR2, stepLFSR2, ...
See RunRepeatAlloca.hs and src/LFSR.hs. Performance-wise this makes no difference (within timing variance).
=== RunRepeatAlloca =======
Baseline: 0.19811199999999998
IO: 0.33433
factor: 1.6875807623970283
After deciphering GHC's assembly product for RunRepeat.hs
I'm coming to this conclusion: GHC won't inline the call to the C function step_lfsr(state_t*)
, whereas the C compiler will, and this makes all the difference for this toy problem.
I can demonstrate this by forbidding inlining with the __attribute__ ((noinline))
pragma. Overall, the C executable gets slower, hence the gap between Haskell and C closes.
Here are the results:
=== RunRepeat =======
#iter: 100000000
Baseline: 0.334414
IO: 0.325433
factor: 0.9731440669349967
=== RunRepeatAlloca =======
#iter: 100000000
Baseline: 0.330629
IO: 0.333735
factor: 1.0093942152684732
=== RunRepeatLoop =====
#iter: 100000000
Baseline: 0.33195399999999997
IO: 0.33791
factor: 1.0179422450098508
I.e. there is no penalty for FFI calls to lfsr_step
anymore.
=== RunAvg =========
#iter: 10000000
Baseline: 3.4072e-2
IO: 1.3602589999999999
factor: 39.92307466541442
=== RunAvgStreaming ===
#iter: 50000000
Baseline: 0.191264
IO: 0.666438
factor: 3.484388070938598
Good old lists don't fuse, hence the huge performance hit, and the streaming
library also isn't optimal. But Data.Vector.Fusion.Stream.Monadic
gets within 20% of the C performance:
=== RunVector =========
#iter: 200000000
Baseline: 0.705265
IO: 0.843916
factor: 1.196594188000255
It has been observed already that GHC doesn't inline FFI calls: "How to force GHC to inline FFI calls?" .
For situations where the benefit of inlining is so high, i.e. the workload per FFI call is so low, it might be worth looking into inline-c
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With