Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are possible Haskell optimizations keys?

I found benchmark that solves really simple task in different languages https://github.com/starius/lang-bench . Here 's the code for Haskell :

cmpsum i j k =
    if i + j == k then 1 else 0

main = print (sum([cmpsum i j k |
    i <- [1..1000], j <- [1..1000], k <- [1..1000]]))

This code runs very slow as you can see in benchmark and I found this very strange. I tried to inline the function cmpsum and compile with the next flags:

ghc -c -O2 main.hs

but it really didn't help. I am not asking about optimizing the algorithm cause it's the same for all languages, but about possible compiler or code optimizations that can make this code run faster.

like image 567
ig-melnyk Avatar asked Jan 27 '15 18:01

ig-melnyk


2 Answers

Not a complete answer, sorry. Compiling with GHC 7.10 on my machine I get ~12s for your version.

I'd suggest always compiling with -Wall which shows us that our numbers are being defaulted to the infinite precision Integer type. Fixing that:

module Main where

cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k =
    if i + j == k then 1 else 0

main :: IO ()
main = print (sum([cmpsum i j k |
    i <- [1..1000], j <- [1..1000], k <- [1..1000]]))

This runs in ~5s for me. Running with +RTS -s seems to show we have a loop in constant memory:

          87,180 bytes allocated in the heap
           1,704 bytes copied during GC
          42,580 bytes maximum residency (1 sample(s))
          18,860 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    4.920s  (  4.919s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    4.920s  (  4.921s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    17,719 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

-fllvm shaves off another second or so. Maybe someone else can look into it further.

Edit: Just digging into this a little further. It doesn't look like fusion is happening. Even if I change sum to a foldr (+) 0 which is an explicit "good producer/good consumer" pair.

Rec {
$wgo [InlPrag=[0], Occ=LoopBreaker] :: Int# -> Int#
[GblId, Arity=1, Str=DmdType <S,U>]
$wgo =
  \ (w :: Int#) ->
    let {
      $j :: Int# -> Int#
      [LclId, Arity=1, Str=DmdType]
      $j =
        \ (ww [OS=OneShot] :: Int#) ->
          letrec {
            $wgo1 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
            [LclId, Arity=1, Str=DmdType <S,1*U>]
            $wgo1 =
              \ (w1 :: [Int]) ->
                case w1 of _ [Occ=Dead] {
                  [] -> ww;
                  : y ys ->
                    case $wgo1 ys of ww1 { __DEFAULT ->
                    case lvl of _ [Occ=Dead] {
                      [] -> ww1;
                      : y1 ys1 ->
                        case y of _ [Occ=Dead] { I# y2 ->
                        case y1 of _ [Occ=Dead] { I# y3 ->
                        case tagToEnum# @ Bool (==# (+# w y2) y3) of _ [Occ=Dead] {
                          False ->
                            letrec {
                              $wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
                              [LclId, Arity=1, Str=DmdType <S,1*U>]
                              $wgo2 =
                                \ (w2 :: [Int]) ->
                                  case w2 of _ [Occ=Dead] {
                                    [] -> ww1;
                                    : y4 ys2 ->
                                      case y4 of _ [Occ=Dead] { I# y5 ->
                                      case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
                                        False -> $wgo2 ys2;
                                        True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
                                      }
                                      }
                                  }; } in
                            $wgo2 ys1;
                          True ->
                            letrec {
                              $wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
                              [LclId, Arity=1, Str=DmdType <S,1*U>]
                              $wgo2 =
                                \ (w2 :: [Int]) ->
                                  case w2 of _ [Occ=Dead] {
                                    [] -> ww1;
                                    : y4 ys2 ->
                                      case y4 of _ [Occ=Dead] { I# y5 ->
                                      case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
                                        False -> $wgo2 ys2;
                                        True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
                                      }
                                      }
                                  }; } in
                            case $wgo2 ys1 of ww2 { __DEFAULT -> +# 1 ww2 }
                        }
                        }
                        }
                    }
                    }
                }; } in
          $wgo1 lvl } in
    case w of wild {
      __DEFAULT -> case $wgo (+# wild 1) of ww { __DEFAULT -> $j ww };
      1000 -> $j 0
    }
end Rec }

In fact, looking at the core for print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] it seems as though only the first layer of the list comprehension is fused. Is that a bug?

like image 60
jberryman Avatar answered Oct 11 '22 16:10

jberryman


This code gets the job done in 1 second and no extra allocation in GHC 7.10 with -O2 (see the bottom for profiling output):

cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k = fromEnum (i+j==k)

main = print $ sum [cmpsum i j k | i <- [1..1000],
                                   j <- [1..const 1000 i],
                                   k <- [1..const 1000 j]]

In GHC 7.8, you can get almost the same results in this case (1.4 seconds) if you add the following at the beginning:

import Prelude hiding (sum)

sum xs = foldr (\x r a -> a `seq` r (a+x)) id xs 0

There are three issues here:

  1. Specializing the code to Int instead of letting it default to Integer is crucial.

  2. GHC 7.10 offers list fusion for sum that GHC 7.8 does not. This is because the new definition of sum, based on a new definition of foldl, can be very bad in some cases without the "call arity" analysis Joachim Breitner created for GHC 7.10.

  3. GHC performs a limited "full laziness" pass very early in compilation, before any inlining occurs. As a result, the constant [1..1000] terms for j and k, which are used multiple times in the loop, get hoisted out of the loop. This would be good if these were actually expensive to calculate, but in this context it's much cheaper to do the additions over and over and over instead of saving the results. What the code above does is trick GHC. Since const isn't inlined until a little bit later, this first full laziness pass doesn't see that the lists are constant, so it doesn't hoist them out. I wrote it this way because it's nice and short, but it is, admittedly, a little on the fragile side. To make it more robust, use phased inlining:

    main = print $ sum [cmpsum i j k | i <- [1..1000],
                                       j <- [1..konst 1000 i],
                                       k <- [1..konst 1000 j]]
    
    {-# INLINE [1] konst #-}
    konst = const
    

    This guarantees that konst will be inlined in simplifier phase 1, but no earlier. Phase 1 occurs after list fusion is complete, so it's perfectly safe to let GHC see everything then.

          51,472 bytes allocated in the heap
           3,408 bytes copied during GC
          44,312 bytes maximum residency (1 sample(s))
          17,128 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.071s  (  1.076s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.073s  (  1.077s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    48,059 bytes per MUT second

  Productivity  99.9% of total user, 99.6% of total elapsed
like image 35
dfeuer Avatar answered Oct 11 '22 17:10

dfeuer