GHC forkIO bimodal performance

Question

I was benchmarking forkIO with the following code:

import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.IORef


n = 200000

main :: IO ()
main = do
    bar <- newEmptyMVar
    count <- newIORef (0 :: Int)
    (d, _) <- duration $ do
        replicateM_ n $ do
            forkIO $ do
                v <- atomicModifyIORef' count $ \old -> (old + 1, old + 1)
                when (v == n) $ putMVar bar ()
        takeMVar bar
    putStrLn $ showDuration d

This spawns 20K threads, counts how many have run with an IORef, and when they have all started, finishes. When run on GHC 8.10.1 on Windows with the command ghc --make -O2 Main -threaded && main +RTS -N4 the performance varies remarkably. Sometimes it takes > 1s (e.g. 1.19s) and sometimes it takes < 0.1s (e.g. 0.08s). It seems that it is in the faster bucket about 1/6th of the time. Why the difference in performance? What causes it to go faster?

When I scale n up to 1M the effect goes away and it's always in the 5+s range.

lehins · Accepted Answer

I can confirm the same behavior on Ubuntu as well. Except when I set n=1M this behavior does not go away and runtime ranges for me from 2 to 7 sec.

I believe non-determinism of the scheduler is the cause for such a significant variance in runtime. This is not a definitive answer, of course, since it is merely my guess.

atomicModifyIORef' is implemented with CAS (compare-and-swap), so depending on how threads are executed the function old + 1 will be recomputed more or less times. In other words if a thread B updates the count ref before thread A gets a chance to update the count ref, but after it started the update, it will have to start over with the update operation from the beginning, thus reading new updated value from the ref and recomputing old + 1 once again.

If you run main +RTS -N1, you will see that not only it takes a lot less time to run the program, but also the runtime is pretty consistent between executions. I suspect it is because only one thread can run at any time and there is no preemption until atomicModifyIORef' is done.

Hopefully someone else with deep understanding of Haskell RTS can provide more insight into this behavior, but that is my take on it.

Edit

@NeilMitchel commented:

I'm not convinced it's anything to do with the atomic modification at all

In order to prove that IORef is indeed at fault there, here is an implementation that uses PVar which relies on casIntArray# underneath. Not only it is 10 times faster, but there is no variance observed:

import System.Time.Extra
import Control.Concurrent
import Control.Monad
import Data.Primitive.PVar -- from `pvar` package


n = 1000000

main :: IO ()
main = do
    bar <- newEmptyMVar
    count <- newPVar (0 :: Int)
    (d, _) <- duration $ do
        replicateM_ n $ do
            forkIO $ do
                v <- atomicModifyIntPVar count $ \old -> (old + 1, old + 1)
                when (v == n) $ putMVar bar ()
        takeMVar bar
    putStrLn $ showDuration d

GHC forkIO bimodal performance

Tags:

performance

haskell

concurrency

Neil Mitchell

1 Answers

lehins

Recent Activity

Donate For Us

GHC forkIO bimodal performance

Tags:

performance

haskell

concurrency

Neil Mitchell

1 Answers

lehins

Related questions

Recent Activity

Donate For Us