Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should this run in parallel?

Tags:

haskell

ghc

I was under impression that Haskell would run a program like the one below in parallel (each a,b,c combination would get pushed through all the filters independently).

main = print $ 
       map (\(a,b,c) -> a * b * c) $
       filter (\(a,b,c) -> a^2 + b^2 == c^2) $
       filter (\(a,b,c) -> a + b + c == 1000) $
       filter (\(a,b,c) -> a < b && b < c) $
       [(a,b,c) | a <- [0..1000], b <- [0..1000], c <- [0..1000]]

But when I run the program I can see that only one of the four threads on my machine are utilised.

Why is my expectation wrong?

like image 922
zoran119 Avatar asked Dec 06 '22 19:12

zoran119


1 Answers

Should this run in parallel?

No, because GHC doesn't add parallelism by default (see below). Also, parallelism isn't a convenient orbital cannon that just blasts away any kind of problem (see further below).

Why is my expectation wrong?

First of all, using runhaskell is mostly the same as using GHCi: it doesn't use optimization since -O con­flicts with --interactive, it doesn't give you additional RTS options, and you cannot use all those nice compiler flags that give you a little bit more juice.

But even if you compile your code with a threaded runtime, you won't end up with a faster ex­e­cu­ta­ble:

$ ghc --make -O2 -rtsopts -with-rtsopts -N -threaded SO.hs 
$ .\SO +RTS -s
[31875000]
   2,863,269,440 bytes allocated in the heap
       1,135,584 bytes copied during GC
         100,016 bytes maximum residency (2 sample(s))
          31,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      5471 colls,  5471 par    0.266s   0.283s     0.0001s    0.0126s
  Gen  1         2 colls,     1 par    0.000s   0.001s     0.0004s    0.0007s

  Parallel GC work balance: 0.00% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time   20.328s  ( 21.671s elapsed)     <-------
  GC      time    0.266s  (  0.284s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   20.609s  ( 21.956s elapsed)

  Alloc rate    140,852,608 bytes per MUT second

  Productivity  98.7% of total user, 92.7% of total elapsed

That's because GHC doesn't add parallelism automatically. While it would be nice to just flip a switch and be done, parallelism can cause a pretty high overhead if done wrong. For example, if f :: Int -> T is a complex function, then running head $ filter p $ parallelMap f [1..100] is probably fine. But calling head $ filter p $ parallelMap f [1..] isn't anymore. After all, Haskell is lazy.

Speeding things up without parallelism

Now that you know why there's no automagical parallelism in Haskell, what can you do to speed up your program? First of all, structure it:

triples :: [(Int, Int, Int)]
triples = filter pythagoras . filter smaller . filter sum1000 $ ts
  where 
    pythagoras (a,b,c) = a ^ 2 + b ^ 2 == c ^ 2
    sum1000    (a,b,c) = a + b + c == 1000
    smaller    (a,b,c) = a < b && b < c

    ts = [(a,b,c) | a <- [0..1000], b <- [0..1000], c <- [0..1000]]

main :: IO ()
main = print $ map (\(a,b,c) -> a * b * c) $ triples

Now, this is a lot easier to read than your previous program. Hm. You generate a list and then apply three filters afterwards. Wait a second. sum1000 and smaller seem off. For any given range, the number of triples that fulfil smaller is usually relatively small, and for any given a and b, there is only one c that fulfils sum1000!

We can fuse both conditions together to get the following ones on a, b and c directly:

  • a can never be larger than 332, since we cannot split 1000 - 333 into b and c so that smaller still holds (667 = 333 + 334)
  • b is always greater than a
  • b can never be larger than (1000 - a) / 2, otherwise there's no suitable c
  • c is always 1000 - a - b, but there is no c for a = 0 and b = 500.

We end up with the following list instead:

triples :: [(Int, Int, Int)]
triples = filter pythagoras . filter smaller . filter sum1000 $ ts
  where
    pythagoras (a,b,c) = a ^ 2 + b ^ 2 == c ^ 2
    sum1000    (a,b,c) = a + b + c == 1000
    smaller    (a,b,c) = a < b && b < c

    ts = [(a,b,c) | a <- [0..332]
                  , b <- [a+1 .. (1000 - a)`div` 2]
                  , let c = 1000 - a - b]

-- Old list for documentation
--  ts = [(a,b,c) | a <- [0..1000], b <- [0..1000], c <- [0..1000]]

You could also remove the filters, but then don't forget the check for b < c.

This is a lot faster, since we're now using an O(n²) approach instead of an O(n³) one. runhaskell SO.hs will finish after 2 seconds on my PC, and if we actually compile it, we end up with an almost immediately finishing executable:

$ ghc --make -O2 SO.hs
$ ./SO +RTS -s
[31875000]
         104,960 bytes allocated in the heap
           1,752 bytes copied during GC
          42,664 bytes maximum residency (1 sample(s))
          18,776 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0005s    0.0005s

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.000s  (  0.002s elapsed)  <----------------
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.000s  (  0.003s elapsed)

  %GC     time       0.0%  (13.9% elapsed)

  Alloc rate    0 bytes per MUT second

  Productivity 100.0% of total user, 0.0% of total elapsed

TL;DR

Decreasing the work to a minuscule of its original size always trumps running far too much work in parallel.

like image 131
Zeta Avatar answered Dec 09 '22 14:12

Zeta