I was benchmarking forkIO
with the following code:
import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.IORef
n = 200000
main :: IO ()
main = do
bar <- newEmptyMVar
count <- newIORef (0 :: Int)
(d, _) <- duration $ do
replicateM_ n $ do
forkIO $ do
v <- atomicModifyIORef' count $ \old -> (old + 1, old + 1)
when (v == n) $ putMVar bar ()
takeMVar bar
putStrLn $ showDuration d
This spawns 20K threads, counts how many have run with an IORef
, and when they have all started, finishes. When run on GHC 8.10.1 on Windows with the command ghc --make -O2 Main -threaded && main +RTS -N4
the performance varies remarkably. Sometimes it takes > 1s (e.g. 1.19s) and sometimes it takes < 0.1s (e.g. 0.08s). It seems that it is in the faster bucket about 1/6th of the time. Why the difference in performance? What causes it to go faster?
When I scale n
up to 1M the effect goes away and it's always in the 5+s range.
I can confirm the same behavior on Ubuntu as well. Except when I set n=1M
this behavior does not go away and runtime ranges for me from 2 to 7 sec.
I believe non-determinism of the scheduler is the cause for such a significant variance in runtime. This is not a definitive answer, of course, since it is merely my guess.
atomicModifyIORef'
is implemented with CAS (compare-and-swap), so depending on how threads are executed the function old + 1
will be recomputed more or less times. In other words if a thread B updates the count
ref before thread A gets a chance to update the count
ref, but after it started the update, it will have to start over with the update operation from the beginning, thus reading new updated value from the ref and recomputing old + 1
once again.
If you run main +RTS -N1
, you will see that not only it takes a lot less time to run the program, but also the runtime is pretty consistent between executions. I suspect it is because only one thread can run at any time and there is no preemption until atomicModifyIORef'
is done.
Hopefully someone else with deep understanding of Haskell RTS can provide more insight into this behavior, but that is my take on it.
Edit
@NeilMitchel commented:
I'm not convinced it's anything to do with the atomic modification at all
In order to prove that IORef is indeed at fault there, here is an implementation that uses PVar
which relies on casIntArray#
underneath. Not only it is 10 times faster, but there is no variance observed:
import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.Primitive.PVar -- from `pvar` package
n = 1000000
main :: IO ()
main = do
bar <- newEmptyMVar
count <- newPVar (0 :: Int)
(d, _) <- duration $ do
replicateM_ n $ do
forkIO $ do
v <- atomicModifyIntPVar count $ \old -> (old + 1, old + 1)
when (v == n) $ putMVar bar ()
takeMVar bar
putStrLn $ showDuration d
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With