I was working on a Project Euler problem and ended up with a Haskell file that included a function that looked like this:
matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr (\(cs', n) a -> fromBool (f cs cs') * n + a) 0
With fromBool
imported from Foreign.Marshal.Utils
just to quickly convert True
to 1
and False
to 0
.
I was trying to get a little more speed out of my solution so I tried switching from foldr
to foldl'
(switching the arguments in the process) as I assumed foldr
didn't make much sense to use on numbers.
Switching from foldr
to foldl'
caused me to allocate more than twice as much memory according to GHC's profiler.
For fun I also decided to replace the lambda with a pointfree version of the function:
matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr ((+) . uncurry ((*) . fromBool . f cs)) 0
This caused my memory allocation to increase 20x from the foldr
version.
Now this isn't a huge deal as even in the 20x case the total memory allocation was only about 135Mb
and the runtime of the program was relatively unaffected, if anything the higher memory allocation versions ran slightly faster.
But I am really curious as to how these results could be possible, so that in future I will be able to pick the "right" function when I don't have as much leeway.
EDIT:
GHC version 7.10.2, compiled with -O2 -prof -fprof-auto
. Executed with +RTS -p
.
EDIT 2:
Alright it looks like this is too difficult to reproduce to omit the rest of the code, well here is the entire program:
SPOILERS BELOW:
{-# LANGUAGE NoMonomorphismRestriction #-}
import Control.Monad
import Data.List
import Foreign.Marshal.Utils
data Color = Red | Green | Blue deriving (Eq, Enum, Bounded, Show)
colors :: [Color]
colors = [Red ..]
matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f x = foldr ((+) . uncurry ((*) . fromBool . f x)) 0
-- matches f x = foldr (\(y, n) a -> fromBool (f x y) * n + a) 0
-- matches f x = foldl' (\a (y, n) -> fromBool (f x y) * n + a) 0
invert :: [([Color], Int)] -> [([Color], Int)]
invert rs = (\cs -> (cs, matches valid cs rs)) <$> choices
where
len = maximum $ length . fst <$> rs
choices = replicateM len colors
valid (x : xs) (y : ys) = x /= y && valid xs ys
valid _ _ = True
expand :: [([Color], Int)] -> [([Color], Int)]
expand rs = (\cs -> (cs, matches valid cs rs)) <$> choices
where
len = maximum $ length . fst <$> rs
choices = replicateM (len + 1) colors
valid (x1 : x2 : xs) (y : ys) = x1 /= y && x2 /= y && valid (x2 : xs) ys
valid _ _ = True
getRow :: Int -> [([Color], Int)]
getRow 1 = flip (,) 1 . pure <$> colors
getRow n = expand . invert $ getRow (n - 1)
result :: Int -> Int
result n = sum $ snd <$> getRow n
main :: IO ()
main = print $ result 8
Note: This post is written in literate Haskell. Copy it into a file, save it as *.lhs, and compile/load with in GHC(i). Also, I started writing this answer before you've edited your code, but the lesson stays the same.
The Prelude
function uncurry
is too lazy, whereas your pattern match is just strict enough.
We're entering a magical, weird place. Beware. Also, my CORE abilities are rudimentary. Now that I've lost all my credibility, let's get started.
In order to know where we get the additional memory requirements, it's useful to have more than two functions.
> import Control.Monad (forM_)
This is your original, non-pointfree variant:
> matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matches f cs = foldr (\(cs', n) a -> fromEnum (f cs cs') * n + a) 0
This is a variant that's only slightly point-free, the parameter a
is eta-reduced.
> matchesPF' :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF' f cs = foldr (\(cs', n) -> (+) (fromEnum (f cs cs') * n)) 0
This is a variant that inlines uncurry
by hand.
> matchesPFI :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFI f cs = foldr ((+) . (\(cs', n) -> fromEnum (f cs cs') * n)) 0
This your pointfree version.
> matchesPF :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF f cs = foldr ((+) . uncurry ((*) . fromEnum . f cs)) 0
This is a variant that uses a custom uncurry
, see below.
> matchesPFU :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFU f cs = foldr ((+) . uncurryI ((*) . fromEnum . f cs)) 0
This is a variant that uses a custom lazy uncurry
, see below.
> matchesPFL :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFL f cs = foldr ((+) . uncurryL ((*) . fromEnum . f cs)) 0
To test the functions easily, we use a list:
> funcs = [matches, matchesPF', matchesPF, matchesPFL, matchesPFU, matchesPFI]
Our self-written uncurry
:
> uncurryI :: (a -> b -> c) -> (a, b) -> c
> uncurryI f (a,b) = f a b
A lazier uncurry
:
> uncurryL :: (a -> b -> c) -> (a, b) -> c
> uncurryL f p = f (fst p) (snd p)
The lazy variant uncurryL
has the same semantics as the variant in Prelude
, e.g.
uncurry (\_ _ -> 0) undefined == 0 == uncurryL (\_ _ -> 0) undefined
whereas uncurryI
is strict in the pair's spine.
> main = do
> let f a b = a < b
> forM_ [1..10] $ \i ->
> forM_ funcs $ \m ->
> print $ m f i (zip (cycle [1..10]) [1..i*100000])
The list [1..i*100000]
depends on i
deliberately, so that we don't introduce a CAF and skew our allocation profile.
Before we delve into the profile, let's have a look at the desugared code of each function:
==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
= {terms: 221, types: 419, coercions: 0}
uncurryL
uncurryL = \ @ a @ b @ c f p -> f (fst p) (snd p)
uncurryI
uncurryI = \ @ a @ b @ c f ds -> case ds of _ { (a, b) -> f a b }
-- uncurried inlined by hand
matchesPFI =
\ @ a f cs ->
foldr
$fFoldable[]
(. (+ $fNumInt)
(\ ds ->
case ds of _ { (cs', n) ->
* $fNumInt (fromEnum $fEnumBool (f cs cs')) n
}))
(I# 0)
-- lazy uncurry
matchesPFL =
\ @ a f cs ->
foldr
$fFoldable[]
(. (+ $fNumInt)
(uncurryL (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
(I# 0)
-- stricter uncurry
matchesPFU =
\ @ a f cs ->
foldr
$fFoldable[]
(. (+ $fNumInt)
(uncurryI (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
(I# 0)
-- normal uncurry
matchesPF =
\ @ a f cs ->
foldr
$fFoldable[]
(. (+ $fNumInt)
(uncurry (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
(I# 0)
-- eta-reduced a
matchesPF' =
\ @ a f cs ->
foldr
$fFoldable[]
(\ ds ->
case ds of _ { (cs', n) ->
+ $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n)
})
(I# 0)
-- non-point-free
matches =
\ @ a f cs ->
foldr
$fFoldable[]
(\ ds a ->
case ds of _ { (cs', n) ->
+ $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n) a
})
(I# 0)
So far, everything seems well. There's nothing surprising going on. Typeclass functions are replaced with their dictionary variants, e.g. foldr
becomesfoldr $fFoldable[]`, since we call it on a list.
Mon Jul 18 15:47 2016 Time and Allocation Profiling Report (Final) Prof +RTS -s -p -RTS total time = 1.45 secs (1446 ticks @ 1000 us, 1 processor) total alloc = 1,144,197,200 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc matchesPF' Main 13.6 0.0 matchesPF Main 13.3 11.5 main.\.\ Main 11.8 76.9 main.f Main 10.9 0.0 uncurryL Main 9.5 11.5 matchesPFU Main 8.9 0.0 matchesPFI Main 7.3 0.0 matches Main 6.9 0.0 matchesPFL Main 6.3 0.0 uncurryI Main 5.3 0.0 matchesPF'.\ Main 2.6 0.0 matchesPFI.\ Main 2.0 0.0 matches.\ Main 1.5 0.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 44 0 0.0 0.0 100.0 100.0 main Main 89 0 0.0 0.0 100.0 100.0 main.\ Main 90 10 0.0 0.0 100.0 100.0 main.\.\ Main 92 60 11.8 76.9 100.0 100.0 funcs Main 93 0 0.0 0.0 88.2 23.1 matchesPFI Main 110 10 7.3 0.0 11.7 0.0 matchesPFI.\ Main 111 5500000 2.0 0.0 4.4 0.0 main.f Main 112 5500000 2.4 0.0 2.4 0.0 matchesPFU Main 107 10 8.9 0.0 15.3 0.0 uncurryI Main 108 5500000 5.3 0.0 6.4 0.0 main.f Main 109 5500000 1.1 0.0 1.1 0.0 matchesPFL Main 104 10 6.3 0.0 17.7 11.5 uncurryL Main 105 5500000 9.5 11.5 11.4 11.5 main.f Main 106 5500000 1.9 0.0 1.9 0.0 matchesPF Main 102 10 13.3 11.5 15.4 11.5 main.f Main 103 5500000 2.1 0.0 2.1 0.0 matchesPF' Main 99 10 13.6 0.0 17.2 0.0 matchesPF'.\ Main 100 5500000 2.6 0.0 3.6 0.0 main.f Main 101 5500000 1.0 0.0 1.0 0.0 matches Main 94 10 6.9 0.0 10.9 0.0 matches.\ Main 97 5500000 1.5 0.0 4.0 0.0 main.f Main 98 5500000 2.5 0.0 2.5 0.0 CAF Main 87 0 0.0 0.0 0.0 0.0 funcs Main 91 1 0.0 0.0 0.0 0.0 main Main 88 1 0.0 0.0 0.0 0.0 main.\ Main 95 0 0.0 0.0 0.0 0.0 main.\.\ Main 96 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 84 0 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 78 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 76 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.Text 75 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 59 0 0.0 0.0 0.0 0.0
Ignore the main\.\.
noise, it's just the list. However, there's one point that one should notice immediately: matchesPF
and uncurryL
use the same alloc%
:
matchesPF Main 13.3 11.5
uncurryL Main 9.5 11.5
Now it's time to inspect the resulting CORE (ghc -ddump-simpl
). We'll notice that most of the functions have been transformed into worker wrappers, and they look more or less the same (-dsuppress-all -dsuppress-uniques
):
$wa5
$wa5 =
\ @ a1 w w1 w2 ->
letrec {
$wgo
$wgo =
\ w3 ->
case w3 of _ {
[] -> 0;
: y ys ->
case y of _ { (cs', n) ->
case $wgo ys of ww { __DEFAULT ->
case w w1 cs' of _ {
False -> case n of _ { I# y1 -> ww };
True -> case n of _ { I# y1 -> +# y1 ww }
}
}
}
}; } in
$wgo w2
This is your usual worker-wrapper. $wgo
takes a list, checks whether it's empty, is strict in the head (case y of _ { (cs', n) ->…
) and lazy in the recursive result $wgo ys of ww
.
All functions look the same. Well, all except matchesPF
(your variant)
-- matchesPF
$wa3 =
\ @ a1 w w1 w2 ->
letrec {
$wgo =
\ w3 ->
case w3 of _ {
[] -> 0;
: y ys ->
case $wgo ys of ww { __DEFAULT ->
case let {
x = case y of _ { (x1, ds) -> x1 } } in
case w w1 x of _ {
False ->
case y of _ { (ds, y1) -> case y1 of _ { I# y2 -> main13 } };
-- main13 is just #I 0
True -> case y of _ { (ds, y1) -> y1 }
}
of _ { I# x ->
+# x ww
}
}
}; } in
$wgo w2
and matchesPFL
(the variant that uses the lazy uncurryL
)
-- matchesPFL
$wa2
$wa2 =
\ @ a1 w w1 w2 ->
letrec {
$wgo =
\ w3 ->
case w3 of _ {
[] -> 0;
: y ys ->
case $wgo ys of ww { __DEFAULT ->
case snd y of ww1 { I# ww2 ->
case let {
x = fst y } in
case w w1 x of _ {
False -> main13;
True -> ww1
}
of _ { I# x ->
+# x ww
}
}
}
}; } in
$wgo w2
They are virtually the same. And both of them contain let
bindings. This will create a thunk and usually lead to worse space requirements.
I think the culprit at this point is clear. It is uncurry
. GHC wants to enforce the correct semantics of
uncurry (const (const 0)) undefined
However, this adds laziness and additional thunks. Your non-pointfree variant doesn't introduce that behaviour, since you pattern match on the pair:
foldr (\(cs', n) a -> …)
Still don't trust me? Use a lazy pattern match
foldr (\ ~(cs', n) a -> …)
and you will notice that matches
will behave the same as matchesPF
. So use slightly stricter variant of uncurry
. uncurryI
is enough to give the strictness analyzer a hint.
Note that pairs are notorious for this behaviour. RWH spents a whole chapter trying to optimize the behaviour of a single function where intermediate pairs lead to problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With