Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up Haskell concurrency

MVar, TVar, IORef, ... I can't speedup a thunk problem (I think).

(My original problem is a threaded code, I do "forkIO" n-times calling "addMany"; but I think my problem is on "shW" function)

Let next code:

{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
import Control.Monad
import System.Environment(getArgs)
import Data.Int
import Data.IORef

-- "i" times, add "n" for each IORef (in "a")
addMany :: [IORef Int64] -> Int64 -> Int64 -> IO ()
addMany !a !n !i =
  forM_ [1..i] (\_ ->
    forM_ a (shW n))

-- MVar, TVar, IORef, ... read/write (x' = x + k)
shR = readIORef
shW !k !r = atomicModifyIORef r (\ !x' -> (x' + k, ()))

main = do
  (n':i':_) <- getArgs
  let (n, i) = (read n', read i')
  v <- forM [1..n] (\_ -> newIORef 0)
  addMany v 1 i
  mapM shR v >>= (putStrLn.show.sum)

then, profile result show:

MUT     time    3.12s  (  3.12s elapsed)
GC      time    6.96s  (  6.96s elapsed)
...

COST CENTRE  MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN         MAIN                     47           0    0.0    0.0   100.0  100.0
 main        Main                     95           0    0.0    0.0   100.0  100.0
  main.i     Main                    100           1    0.0    0.0     0.0    0.0
  addMany    Main                     99           1    0.4    0.5   100.0  100.0
   addMany.\ Main                    101       15000    6.6    0.0    99.6   99.5
    shW      Main                    102     2250000   92.7   99.5    93.0   99.5
     shW.\   Main                    104     2250000    0.3    0.0     0.3    0.0

I can't remove thunk on "shW" calls (and memory usage is huge). What wrong?

A similar C# code run much (much) faster:

class Node { 
    private object m; 
    private int n; 

    public Node() {n = 0; m = new object();} 
    public void Inc() {lock(m) n++;} 
    public int Read() {return n;} 
} 

class MainClass { 

    public static void Main(string[] args) { 

        var nitems = 1000; 
        var nthreads = 6; 
        var niters = 100000; 

        var g = Enumerable.Range(0, nitems).Select(i => new Node()).ToArray(); 
        Task.WaitAll(Enumerable.Range(0, nthreads).Select(q => Task.Factory.StartNew(() => { 
            var z = niters; 
            while(z-- > 0) 
                foreach(var i in g) 
                    i.Inc(); 
        })).ToArray()); 

        Console.WriteLine("Sum node values: {0}", g.Sum(i => i.Read())); 

    } 
} 

Thanks a lot!

UPDATE

Solved original problem at: https://gist.github.com/3742897

Thank you very much Don Stewart!

like image 200
josejuan Avatar asked Sep 18 '12 10:09

josejuan


People also ask

Is Haskell good for concurrency?

Parallel and concurrent programming is much easier in Haskell: it's pure, which means that there are no mutations to observe and all data can be shared between threads freely; it supports modern techniques like STM (among many other options for parallel and concurrent programming);

Is Haskell parallel?

Haskell supports both pure parallelism and explicit concurrency.

Is Haskell Compiler slow?

Haskell (with the GHC compiler) is a lot faster than you'd expect. Used correctly, it can get close-ish to low-level languages. (A favorite thing for Haskellers to do is to try and get within 5% of C (or even beat it, but that means you are using an inefficient C program, since GHC compiles Haskell to C).)

Is Haskell single threaded?

Without -threaded , the Haskell process uses a single OS thread only, and multithreaded foreign calls are not supported.


1 Answers

It is immediately obvious there is a leak when you look at the heap and the GC time:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
   1,208,298,840 bytes allocated in the heap
   1,352,861,868 bytes copied during GC
     280,181,684 bytes maximum residency (9 sample(s))
       4,688,680 bytes maximum slop
             545 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1677 colls,     0 par    2.27s    2.22s     0.0013s    0.0063s
  Gen  1         9 colls,     0 par    1.66s    1.77s     0.1969s    1.0273s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.70s  (  0.77s elapsed)
  GC      time    3.92s  (  4.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.62s  (  4.77s elapsed)

  %GC     time      84.8%  (83.8% elapsed)

  Alloc rate    1,718,469,461 bytes per MUT second

  Productivity  15.2% of total user, 14.7% of total elapsed

  real    0m4.752s
  user    0m0.015s
  sys     0m0.046s

280M residency and 89% GC. A lot of thunks are being allocated and thrown away.

A heap profile makes this obivous.

enter image description here

The clue is that these are "stg_app*" thingies (i.e. STG machine apply thunks).

A subtle, but remarkable, issue with the modify family of functions is the issue here -- when you have a lazy atomicModify, there simply is no way to strictly update the field, without demanding the value.

So all your careful use of atomicModifyIORef r (\ !x' -> (x' + k, ())) does is build a chain of applications of the (+) function, such that the result of the chain (i.e. the value in the cell) is observed, each addition will be strict in its argument. Not what you want! None of your strictness annotations on the argument to modifyIORef will have any effect on the cell itself. Now, normally a lazy modify is what you want -- it is just a pointer swap , so you can have very short atomic sections.

But sometimes that's not what you want.

(For the background to this issue see GHC ticket #5926, however, the issue was known at least in 2007, when I wrote the strict-concurrency package to avoid this issue with MVars. It was discussed in 2009, and we now have strict versions in 2012).

By demanding the value first, you can remove the issue. E.g.

shW !k !r = --atomicModifyIORef r (\x -> (x + k, ()))
    do x <- shR r
       writeIORef r $! (x+1)

Note that this issue is now documented in the libraries, and you can use atomicModifyIORef' to avoid it.

And we get:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
     120,738,836 bytes allocated in the heap
       3,758,476 bytes copied during GC
          73,020 bytes maximum residency (1 sample(s))
          16,348 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       230 colls,     0 par    0.00s    0.01s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.17s  (  0.17s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.19s  (  0.17s elapsed)

  %GC     time       0.0%  (3.9% elapsed)

  Alloc rate    643,940,458 bytes per MUT second

  Productivity 100.0% of total user, 108.3% of total elapsed


real    0m0.218s
user    0m0.000s
sys     0m0.015s

That is, a 22x speedup, and memory use becomes constant. For laughs, here's the new heap profile:

enter image description here

like image 158
Don Stewart Avatar answered Nov 10 '22 15:11

Don Stewart