I'm writing a program where a large number of agents listen for events and react on them. Since Control.Concurrent.Chan.dupChan
is deprecated I decided to use TChan's as advertised.
The performance of TChan is much worse than I expected. I have the following program that illustrates the issue:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Concurrent.STM
import Control.Concurrent
import System.Random(randomRIO)
import Control.Monad(forever, when)
allCoords :: [(Int,Int)]
allCoords = [(x,y) | x <- [0..99], y <- [0..99]]
randomCoords :: IO (Int,Int)
randomCoords = do
x <- randomRIO (0,99)
y <- randomRIO (0,99)
return (x,y)
main = do
chan <- newTChanIO :: IO (TChan ((Int,Int),Int))
let watcher p = do
chan' <- atomically $ dupTChan chan
forkIO $ forever $ do
r@(p',_counter) <- atomically $ readTChan chan'
when (p == p') (print r)
return ()
mapM_ watcher allCoords
let go !cnt = do
xy <- randomCoords
atomically $ writeTChan chan (xy,cnt)
go (cnt+1)
go 1
When compiled (-O) and run the program first will output something like this:
./tchantest ((0,25),341) ((0,33),523) ((0,33),654) ((0,35),196) ((0,48),181) ((0,48),446) ((1,15),676) ((1,50),260) ((1,78),561) ((2,30),622) ((2,38),383) ((2,41),365) ((2,50),596) ((2,57),194) ((3,19),259) ((3,27),344) ((3,33),65) ((3,37),124) ((3,49),109) ((3,72),91) ((3,87),637) ((3,96),14) ((4,0),34) ((4,17),390) ((4,73),381) ((4,74),217) ((4,78),150) ((5,7),476) ((5,27),207) ((5,47),197) ((5,49),543) ((5,53),641) ((5,58),175) ((5,70),497) ((5,88),421) ((5,89),617) ((6,0),15) ((6,4),322) ((6,16),661) ((6,18),405) ((6,30),526) ((6,50),183) ((6,61),528) ((7,0),74) ((7,28),479) ((7,66),418) ((7,72),318) ((7,79),101) ((7,84),462) ((7,98),669) ((8,5),126) ((8,64),113) ((8,77),154) ((8,83),265) ((9,4),253) ((9,26),220) ((9,41),255) ((9,63),51) ((9,64),229) ((9,73),621) ((9,76),384) ((9,92),569) ...
And then, at some point, will stop writing anything, while still consuming 100% cpu.
((20,56),186) ((20,58),558) ((20,68),277) ((20,76),102) ((21,5),396) ((21,7),84)
With -threaded the lockup is even faster and occurs after only a handful of lines. It will also consume whatever number of cores are made available through RTS' -N flag.
Additionally the performance seems rather poor - only about 100 events per second are processed.
Is this a bug in STM or am I misunderstanding something about semantics of STM?
The program is going to perform quite badly. You're spawning off 10,000 threads all of which will queue up waiting for a single TVar to be written to. So once they're all going, you may well get this happening:
So each item will cause processing O(10,000). If you see 100 events per second, that means that each thread requires about 1 microsecond to wake up, read a couple of TVars, write to one and queue up again. That doesn't seem so unreasonable. I don't understand why the program would grind to a complete halt, though.
In general, I would scrap this design and replace it as follows:
Have a single thread reading the event channel, which maintains a map from coordinate to interested-receiver-channel. The single thread can then pick out the receiver(s) from the map in O(log N) time (much better than O(N), and with a much smaller constant factor involved), and send the event to just the interested receiver. So you perform just one or two communications to the interested party, rather than 10,000 communications to everyone. A list-based form of the idea is written in CHP in section 5.4 of this paper: http://chplib.files.wordpress.com/2011/05/chp.pdf
This is a great test case! I think you've actually created a rare instance of genuine livelock/starvation. We can test this by compiling with -eventlog
and running with -vst
or by compiling with -debug
and running with -Ds
. We see that even as the program "hangs" the runtime still is working like crazy, jumping between blocked threads.
The high-level reason is that you have one (fast) writer and many (fast) readers. The readers and writer both need to access the same tvar representing the end of the queue. Let's say that nondeterministically one thread succeeds and all others fail when this happens. Now, as we increase the number of threads in contention to 100*100, then the probability of the reader making progress rapidly goes towards zero. In the meantime, the writer in fact takes longer in its access to that tvar than do the readers, so that makes things worse for it.
In this instance, putting a tiny throttle between each invocation of go
for the writer (say, threadDelay 100
) is enough to fix the problem. It gives the readers enough time to all block between successive writes, and so eliminates the livelock. However, I do think that it would be an interesting problem to improve the behavior of the runtime scheduler to deal with situations like this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With