I found benchmark that solves really simple task in different languages https://github.com/starius/lang-bench . Here 's the code for Haskell :
cmpsum i j k =
if i + j == k then 1 else 0
main = print (sum([cmpsum i j k |
i <- [1..1000], j <- [1..1000], k <- [1..1000]]))
This code runs very slow as you can see in benchmark and I found this very strange. I tried to inline the function cmpsum and compile with the next flags:
ghc -c -O2 main.hs
but it really didn't help. I am not asking about optimizing the algorithm cause it's the same for all languages, but about possible compiler or code optimizations that can make this code run faster.
Not a complete answer, sorry. Compiling with GHC 7.10 on my machine I get ~12s for your version.
I'd suggest always compiling with -Wall
which shows us that our numbers are being defaulted to the infinite precision Integer
type. Fixing that:
module Main where
cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k =
if i + j == k then 1 else 0
main :: IO ()
main = print (sum([cmpsum i j k |
i <- [1..1000], j <- [1..1000], k <- [1..1000]]))
This runs in ~5s for me. Running with +RTS -s
seems to show we have a loop in constant memory:
87,180 bytes allocated in the heap
1,704 bytes copied during GC
42,580 bytes maximum residency (1 sample(s))
18,860 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.001s elapsed)
MUT time 4.920s ( 4.919s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 4.920s ( 4.921s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 17,719 bytes per MUT second
Productivity 100.0% of total user, 100.0% of total elapsed
-fllvm
shaves off another second or so. Maybe someone else can look into it further.
Edit: Just digging into this a little further. It doesn't look like fusion is happening. Even if I change sum
to a foldr (+) 0
which is an explicit "good producer/good consumer" pair.
Rec {
$wgo [InlPrag=[0], Occ=LoopBreaker] :: Int# -> Int#
[GblId, Arity=1, Str=DmdType <S,U>]
$wgo =
\ (w :: Int#) ->
let {
$j :: Int# -> Int#
[LclId, Arity=1, Str=DmdType]
$j =
\ (ww [OS=OneShot] :: Int#) ->
letrec {
$wgo1 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
[LclId, Arity=1, Str=DmdType <S,1*U>]
$wgo1 =
\ (w1 :: [Int]) ->
case w1 of _ [Occ=Dead] {
[] -> ww;
: y ys ->
case $wgo1 ys of ww1 { __DEFAULT ->
case lvl of _ [Occ=Dead] {
[] -> ww1;
: y1 ys1 ->
case y of _ [Occ=Dead] { I# y2 ->
case y1 of _ [Occ=Dead] { I# y3 ->
case tagToEnum# @ Bool (==# (+# w y2) y3) of _ [Occ=Dead] {
False ->
letrec {
$wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
[LclId, Arity=1, Str=DmdType <S,1*U>]
$wgo2 =
\ (w2 :: [Int]) ->
case w2 of _ [Occ=Dead] {
[] -> ww1;
: y4 ys2 ->
case y4 of _ [Occ=Dead] { I# y5 ->
case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
False -> $wgo2 ys2;
True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
}
}
}; } in
$wgo2 ys1;
True ->
letrec {
$wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
[LclId, Arity=1, Str=DmdType <S,1*U>]
$wgo2 =
\ (w2 :: [Int]) ->
case w2 of _ [Occ=Dead] {
[] -> ww1;
: y4 ys2 ->
case y4 of _ [Occ=Dead] { I# y5 ->
case tagToEnum# @ Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
False -> $wgo2 ys2;
True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
}
}
}; } in
case $wgo2 ys1 of ww2 { __DEFAULT -> +# 1 ww2 }
}
}
}
}
}
}; } in
$wgo1 lvl } in
case w of wild {
__DEFAULT -> case $wgo (+# wild 1) of ww { __DEFAULT -> $j ww };
1000 -> $j 0
}
end Rec }
In fact, looking at the core for print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]]
it seems as though only the first layer of the list comprehension is fused. Is that a bug?
This code gets the job done in 1 second and no extra allocation in GHC 7.10 with -O2
(see the bottom for profiling output):
cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k = fromEnum (i+j==k)
main = print $ sum [cmpsum i j k | i <- [1..1000],
j <- [1..const 1000 i],
k <- [1..const 1000 j]]
In GHC 7.8, you can get almost the same results in this case (1.4 seconds) if you add the following at the beginning:
import Prelude hiding (sum)
sum xs = foldr (\x r a -> a `seq` r (a+x)) id xs 0
There are three issues here:
Specializing the code to Int
instead of letting it default to Integer
is crucial.
GHC 7.10 offers list fusion for sum
that GHC 7.8 does not. This is because the new definition of sum
, based on a new definition of foldl
, can be very bad in some cases without the "call arity" analysis Joachim Breitner created for GHC 7.10.
GHC performs a limited "full laziness" pass very early in compilation, before any inlining occurs. As a result, the constant [1..1000]
terms for j
and k
, which are used multiple times in the loop, get hoisted out of the loop. This would be good if these were actually expensive to calculate, but in this context it's much cheaper to do the additions over and over and over instead of saving the results. What the code above does is trick GHC. Since const
isn't inlined until a little bit later, this first full laziness pass doesn't see that the lists are constant, so it doesn't hoist them out. I wrote it this way because it's nice and short, but it is, admittedly, a little on the fragile side. To make it more robust, use phased inlining:
main = print $ sum [cmpsum i j k | i <- [1..1000],
j <- [1..konst 1000 i],
k <- [1..konst 1000 j]]
{-# INLINE [1] konst #-}
konst = const
This guarantees that konst
will be inlined in simplifier phase 1, but no earlier. Phase 1 occurs after list fusion is complete, so it's perfectly safe to let GHC see everything then.
51,472 bytes allocated in the heap
3,408 bytes copied during GC
44,312 bytes maximum residency (1 sample(s))
17,128 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.071s ( 1.076s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.073s ( 1.077s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 48,059 bytes per MUT second
Productivity 99.9% of total user, 99.6% of total elapsed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With