I was under impression that Haskell would run a program like the one below in parallel (each a,b,c
combination would get pushed through all the filter
s independently).
main = print $
map (\(a,b,c) -> a * b * c) $
filter (\(a,b,c) -> a^2 + b^2 == c^2) $
filter (\(a,b,c) -> a + b + c == 1000) $
filter (\(a,b,c) -> a < b && b < c) $
[(a,b,c) | a <- [0..1000], b <- [0..1000], c <- [0..1000]]
But when I run the program I can see that only one of the four threads on my machine are utilised.
Why is my expectation wrong?
Should this run in parallel?
No, because GHC doesn't add parallelism by default (see below). Also, parallelism isn't a convenient orbital cannon that just blasts away any kind of problem (see further below).
Why is my expectation wrong?
First of all, using runhaskell
is mostly the same as using GHCi: it doesn't use optimization since -O
conflicts with --interactive
, it doesn't give you additional RTS options, and you cannot use all those nice compiler flags that give you a little bit more juice.
But even if you compile your code with a threaded runtime, you won't end up with a faster executable:
$ ghc --make -O2 -rtsopts -with-rtsopts -N -threaded SO.hs $ .\SO +RTS -s [31875000] 2,863,269,440 bytes allocated in the heap 1,135,584 bytes copied during GC 100,016 bytes maximum residency (2 sample(s)) 31,152 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 5471 colls, 5471 par 0.266s 0.283s 0.0001s 0.0126s Gen 1 2 colls, 1 par 0.000s 0.001s 0.0004s 0.0007s Parallel GC work balance: 0.00% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 20.328s ( 21.671s elapsed) <------- GC time 0.266s ( 0.284s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 20.609s ( 21.956s elapsed) Alloc rate 140,852,608 bytes per MUT second Productivity 98.7% of total user, 92.7% of total elapsed
That's because GHC doesn't add parallelism automatically. While it would be nice to just flip a switch and be done, parallelism can cause a pretty high overhead if done wrong. For example, if f :: Int -> T
is a complex function, then running head $ filter p $ parallelMap f [1..100]
is probably fine. But calling head $ filter p $ parallelMap f [1..]
isn't anymore. After all, Haskell is lazy.
Now that you know why there's no automagical parallelism in Haskell, what can you do to speed up your program? First of all, structure it:
triples :: [(Int, Int, Int)]
triples = filter pythagoras . filter smaller . filter sum1000 $ ts
where
pythagoras (a,b,c) = a ^ 2 + b ^ 2 == c ^ 2
sum1000 (a,b,c) = a + b + c == 1000
smaller (a,b,c) = a < b && b < c
ts = [(a,b,c) | a <- [0..1000], b <- [0..1000], c <- [0..1000]]
main :: IO ()
main = print $ map (\(a,b,c) -> a * b * c) $ triples
Now, this is a lot easier to read than your previous program. Hm. You generate a list and then apply three filters afterwards. Wait a second. sum1000
and smaller
seem off. For any given range, the number of triples that fulfil smaller
is usually relatively small, and for any given a
and b
, there is only one c
that fulfils sum1000
!
We can fuse both conditions together to get the following ones on a
, b
and c
directly:
a
can never be larger than 332
, since we cannot split 1000 - 333
into b
and c
so that smaller
still holds (667 = 333 + 334
)b
is always greater than a
b
can never be larger than (1000 - a) / 2
, otherwise there's no suitable c
c
is always 1000 - a - b
, but there is no c
for a = 0
and b = 500
.We end up with the following list instead:
triples :: [(Int, Int, Int)]
triples = filter pythagoras . filter smaller . filter sum1000 $ ts
where
pythagoras (a,b,c) = a ^ 2 + b ^ 2 == c ^ 2
sum1000 (a,b,c) = a + b + c == 1000
smaller (a,b,c) = a < b && b < c
ts = [(a,b,c) | a <- [0..332]
, b <- [a+1 .. (1000 - a)`div` 2]
, let c = 1000 - a - b]
-- Old list for documentation
-- ts = [(a,b,c) | a <- [0..1000], b <- [0..1000], c <- [0..1000]]
You could also remove the filters, but then don't forget the check for b < c
.
This is a lot faster, since we're now using an O(n²) approach instead of an O(n³) one. runhaskell SO.hs
will finish after 2 seconds on my PC, and if we actually compile it, we end up with an almost immediately finishing executable:
$ ghc --make -O2 SO.hs $ ./SO +RTS -s [31875000] 104,960 bytes allocated in the heap 1,752 bytes copied during GC 42,664 bytes maximum residency (1 sample(s)) 18,776 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0005s 0.0005s INIT time 0.000s ( 0.001s elapsed) MUT time 0.000s ( 0.002s elapsed) <---------------- GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.001s elapsed) Total time 0.000s ( 0.003s elapsed) %GC time 0.0% (13.9% elapsed) Alloc rate 0 bytes per MUT second Productivity 100.0% of total user, 0.0% of total elapsed
Decreasing the work to a minuscule of its original size always trumps running far too much work in parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With