I have a program I'm trying to parallelize (full paste with runnable code here).
I've profiled and found that the majority of time is spent in findNearest
which is essentially a simple foldr
over a large Data.Map
.
findNearest :: RGB -> M.Map k RGB -> (k, Word32)
findNearest rgb m0 =
M.foldrWithKey' minDistance (k0, distance rgb r0) m0
where (k0, r0) = M.findMin m0
minDistance k r x@(_, d1) =
-- Euclidean distance in RGB-space
let d0 = distance rgb r
in if d0 < d1 then (k, d0) else x
parFindNearest
is supposed to execute findNearest
in parallel over subtrees of the larger Map
.
parFindNearest :: NFData k => RGB -> M.Map k RGB -> (k, Word32)
parFindNearest rgb = minimumBy (comparing snd)
. parMap rdeepseq (findNearest rgb)
. M.splitRoot
Unfortunately GHC GC's most my sparks before they are converted into useful parallelism.
Here's the result of compiling with ghc -O2 -threaded
and running with +RTS -s -N2
839,892,616 bytes allocated in the heap
123,999,464 bytes copied during GC
5,320,184 bytes maximum residency (19 sample(s))
3,214,200 bytes maximum slop
16 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1550 colls, 1550 par 0.23s 0.11s 0.0001s 0.0004s
Gen 1 19 colls, 18 par 0.11s 0.06s 0.0030s 0.0052s
Parallel GC work balance: 16.48% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 215623 (1318 converted, 0 overflowed, 0 dud, 198111 GC'd, 16194 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.72s ( 3.66s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.07s ( 3.84s elapsed)
Alloc rate 225,726,318 bytes per MUT second
Productivity 91.6% of total user, 97.1% of total elapsed
gc_alloc_block_sync: 9862
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2103
As you can see, the majority of sparks are GC'd or fizzle before being converted. I've tried experimenting with different strictness, having findNearest
return a custom strict pair data type instead of a tuple
, or using rdeepseq from Control.Parallel.Strategies
, but my sparks are still GC'd.
I'd like to know
I'm not at expert in parallel strategies, so I may be completely wrong. But:
If you disable GC by setting big enough allocation area (e.g. using -A20M
runtime option), you'll see that most of sparks are fizzled, not GC'd. It means they are evaluated by ordinary program flow before the corresponding spark finished.
minimumBy
forces parMap
results immediately, starting evaluating them. At the same time, sparks are scheduled and executed, but it is too late. When spark finished, the value is already evaluated by the main thread. Without -A20M
, sparks are GC'd because the value is evaluated and GC'd even before the spark is scheduled.
Here is a simplified test case:
import Control.Parallel.Strategies
f :: Integer -> Integer
f 0 = 1
f n = n * f (n - 1)
main :: IO ()
main = do
let l = [n..n+10]
n = 1
res = parMap rdeepseq f l
print res
In that case all the sparks are fizzled:
SPARKS: 11 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 11 fizzled)
(Some times they are GC'd)
But if I yield main thread before printing results,
import Control.Parallel.Strategies
import Control.Concurrent
f :: Integer -> Integer
f 0 = 1
f n = n * f (n - 1)
main :: IO ()
main = do
let l = [n..n+10]
n = 1
res = parMap rdeepseq f l
res `seq` threadDelay 1
print res
Then all the sparks are converted:
SPARKS: 11 (11 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
So, looks like you have not enough sparks (try to set l = [n..n+1000]
in my example), and they are not heavy enough (try to set n = 1000
in my example).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With