I have an AWS instance. I would like to run a bunch of tasks, some memory and cpu intensive. Ideally, I would like to compute timing information on each task. If I run them in serial, it computes accurate timing information, but it's slow. If I run them in parallel, the whole thing is faster, but individual tasks are slower, as reported by both wall time and thread CPU time.
This slowdown increases as the number of threads increases up to the number of CPUs
Cursory examination with ghc-events-analyze
and +RTS -s
suggests that the source of the slowdown is (unsurprisingly) GC pauses. Playing with RTS options reveals that +RTS -qg -qb -qa -A256m
(disabling parallel GC, disabling load balancing GC, disabling thread migration, and increasing the GC allocation area) improves this, but does not completely eliminate it.
I am running threads using forkIO
, but the threads are independent and pure apart from printing progress information. I'm using parallel-io to manage the number of running threads, but when I briefly tried a more conventional approach of having a fixed pool of threads and a task queue, I still had this problem.
Any suggestions for how to debug?
EDIT:
@jberryman asked for an example. Each of the tasks looks like the below code
computation params = do
!x <- force params
print $ "Starting computation on " ++ show params
t1 <- getCPUTime
!y <- fmap force $ do $
...some work with x ...
t2 <- getCPUTime
print $ "Finished computation on " ++ show params
return (t2 - t1, y)
Since the tasks are all independent, and you're on an AWS instance (which is probably Linux), you'll probably have better results using forkProcess
. This way, each process will have its own GC pool, which will get freed up when the process exits, and the parent doesn't have to worry about holding more than the process IDs for the children and waiting for them to die.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With