MVar, TVar, IORef, ... I can't speedup a thunk problem (I think).
(My original problem is a threaded code, I do "forkIO" n-times calling "addMany"; but I think my problem is on "shW" function)
Let next code:
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
import Control.Monad
import System.Environment(getArgs)
import Data.Int
import Data.IORef
-- "i" times, add "n" for each IORef (in "a")
addMany :: [IORef Int64] -> Int64 -> Int64 -> IO ()
addMany !a !n !i =
forM_ [1..i] (\_ ->
forM_ a (shW n))
-- MVar, TVar, IORef, ... read/write (x' = x + k)
shR = readIORef
shW !k !r = atomicModifyIORef r (\ !x' -> (x' + k, ()))
main = do
(n':i':_) <- getArgs
let (n, i) = (read n', read i')
v <- forM [1..n] (\_ -> newIORef 0)
addMany v 1 i
mapM shR v >>= (putStrLn.show.sum)
then, profile result show:
MUT time 3.12s ( 3.12s elapsed)
GC time 6.96s ( 6.96s elapsed)
...
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 47 0 0.0 0.0 100.0 100.0
main Main 95 0 0.0 0.0 100.0 100.0
main.i Main 100 1 0.0 0.0 0.0 0.0
addMany Main 99 1 0.4 0.5 100.0 100.0
addMany.\ Main 101 15000 6.6 0.0 99.6 99.5
shW Main 102 2250000 92.7 99.5 93.0 99.5
shW.\ Main 104 2250000 0.3 0.0 0.3 0.0
I can't remove thunk on "shW" calls (and memory usage is huge). What wrong?
A similar C# code run much (much) faster:
class Node {
private object m;
private int n;
public Node() {n = 0; m = new object();}
public void Inc() {lock(m) n++;}
public int Read() {return n;}
}
class MainClass {
public static void Main(string[] args) {
var nitems = 1000;
var nthreads = 6;
var niters = 100000;
var g = Enumerable.Range(0, nitems).Select(i => new Node()).ToArray();
Task.WaitAll(Enumerable.Range(0, nthreads).Select(q => Task.Factory.StartNew(() => {
var z = niters;
while(z-- > 0)
foreach(var i in g)
i.Inc();
})).ToArray());
Console.WriteLine("Sum node values: {0}", g.Sum(i => i.Read()));
}
}
Thanks a lot!
UPDATE
Solved original problem at: https://gist.github.com/3742897
Thank you very much Don Stewart!
Parallel and concurrent programming is much easier in Haskell: it's pure, which means that there are no mutations to observe and all data can be shared between threads freely; it supports modern techniques like STM (among many other options for parallel and concurrent programming);
Haskell supports both pure parallelism and explicit concurrency.
Haskell (with the GHC compiler) is a lot faster than you'd expect. Used correctly, it can get close-ish to low-level languages. (A favorite thing for Haskellers to do is to try and get within 5% of C (or even beat it, but that means you are using an inefficient C program, since GHC compiles Haskell to C).)
Without -threaded , the Haskell process uses a single OS thread only, and multithreaded foreign calls are not supported.
It is immediately obvious there is a leak when you look at the heap and the GC time:
USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
1,208,298,840 bytes allocated in the heap
1,352,861,868 bytes copied during GC
280,181,684 bytes maximum residency (9 sample(s))
4,688,680 bytes maximum slop
545 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1677 colls, 0 par 2.27s 2.22s 0.0013s 0.0063s
Gen 1 9 colls, 0 par 1.66s 1.77s 0.1969s 1.0273s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.70s ( 0.77s elapsed)
GC time 3.92s ( 4.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.62s ( 4.77s elapsed)
%GC time 84.8% (83.8% elapsed)
Alloc rate 1,718,469,461 bytes per MUT second
Productivity 15.2% of total user, 14.7% of total elapsed
real 0m4.752s
user 0m0.015s
sys 0m0.046s
280M residency and 89% GC. A lot of thunks are being allocated and thrown away.
A heap profile makes this obivous.
The clue is that these are "stg_app*" thingies (i.e. STG machine apply thunks).
A subtle, but remarkable, issue with the modify family of functions is the issue here -- when you have a lazy atomicModify, there simply is no way to strictly update the field, without demanding the value.
So all your careful use of atomicModifyIORef r (\ !x' -> (x' + k, ()))
does is build a chain of applications of the (+)
function, such that the result of the chain (i.e. the value in the cell) is observed, each addition will be strict in its argument. Not what you want! None of your strictness annotations on the argument to modifyIORef will have any effect on the cell itself. Now, normally a lazy modify is what you want -- it is just a pointer swap , so you can have very short atomic sections.
But sometimes that's not what you want.
(For the background to this issue see GHC ticket #5926, however, the issue was known at least in 2007, when I wrote the strict-concurrency package to avoid this issue with MVars. It was discussed in 2009, and we now have strict versions in 2012).
By demanding the value first, you can remove the issue. E.g.
shW !k !r = --atomicModifyIORef r (\x -> (x + k, ()))
do x <- shR r
writeIORef r $! (x+1)
Note that this issue is now documented in the libraries, and you can use atomicModifyIORef'
to avoid it.
And we get:
USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
120,738,836 bytes allocated in the heap
3,758,476 bytes copied during GC
73,020 bytes maximum residency (1 sample(s))
16,348 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 230 colls, 0 par 0.00s 0.01s 0.0000s 0.0001s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.17s ( 0.17s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.19s ( 0.17s elapsed)
%GC time 0.0% (3.9% elapsed)
Alloc rate 643,940,458 bytes per MUT second
Productivity 100.0% of total user, 108.3% of total elapsed
real 0m0.218s
user 0m0.000s
sys 0m0.015s
That is, a 22x speedup, and memory use becomes constant. For laughs, here's the new heap profile:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With