I was looking into making a foldl
that worked on infinite lists, for situations where you could not obtain guarded recursion but where depending the first argument the second argument might not be used.
For example multiplication, where usually you need both arguments and guarded recursion does not work, but if the first argument is 0 you can short circuit.
So I wrote the following function:
foldlp :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlp f p = go where
go b [] = b
go b (x : xs)
| p b = go (f b x) xs
| otherwise = b
And tested it with my custom short circuiting multiplication function:
mult :: Integer -> Integer -> Integer
mult 0 _ = 0
mult x y = x * y
main :: IO ()
main = print . <test_function>
The results I got with -prof -fprof-auto -O2
, +RTS -p
were:
foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
total time = 0.40 secs
total alloc = 480,049,336 bytes
foldlp mult (`seq` True) 1 $ replicate (10 ^ 7) 1
total time = 0.37 secs
total alloc = 480,049,336 bytes
foldl' mult 1 $ replicate (10 ^ 7) 1
total time = 0.37 secs
total alloc = 480,049,352 bytes
foldl mult 1 $ replicate (10 ^ 7) 1
total time = 0.74 secs
total alloc = 880,049,352 bytes
foldr mult 1 $ replicate (10 ^ 7) 1
total time = 0.87 secs
total alloc = 880,049,336 bytes
Which was very promising, as my custom function allows for flexible types of strictness, and also works on infinite lists
The first example will terminate as soon as it hits a 0
, as will foldr
, but foldr
is much slower.
It avoids issues such as thunks within tuples, seeing as ((1 + 2) + 3, (10 + 20) + 30)
is technically in WHNF, breaking foldl'
.
You can re-obtain foldl
with flip foldl (const True)
and foldl'
with flip foldl (
seqTrue)
. And the performance characteristics of the original restricted functions seem to be re-obtained by doing so.
So as a side note I think foldlp
would be a worthwhile addition to Foldable
.
But my actual question was why when I added {-# INLINE foldlp #-}
the functions performance went down significantly, giving me:
foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
total time = 0.67 secs
total alloc = 800,049,336 bytes
So my real question is why that is happening. I thought the downside of inlining was code bloat, not significant negative effects on runtime performance and increased memory usage.
According to the GHC docs the INLINE
pragma prevents other compiler optimizations in order to still let rewrite rules take effect.
So my guess would be, that by using INLINE
you remove some optimization, that GHC would have applied to make your code faster.
After some poking around in the core (use -ddump-simpl
in compilation) I found the optimization, that GHC performs. For this I took a look at the core for foldlp
with inlining and without inlining:
Inlined:
foldlp =
\ (@ b_a10N)
(@ a_a10O)
(eta_B2 :: b_a10N -> a_a10O -> b_a10N)
(eta1_B1 :: b_a10N -> Bool)
(eta2_X3 :: b_a10N)
(eta3_X5 :: [a_a10O]) ->
letrec {
go_s1Ao [Occ=LoopBreaker] :: b_a10N -> [a_a10O] -> b_a10N
[LclId, Arity=2, Str=DmdType <L,U><S,1*U>]
go_s1Ao =
\ (b1_avT :: b_a10N) (ds_d1xQ :: [a_a10O]) ->
-- Removed the actual definition of go for brevity,
-- it's the same in both cases
}; } in
go_s1Ao eta2_X3 eta3_X5
Non-inlined:
foldlp =
\ (@ b_a10N)
(@ a_a10O)
(f_avQ :: b_a10N -> a_a10O -> b_a10N)
(p_avR :: b_a10N -> Bool) ->
letrec {
go_s1Am [Occ=LoopBreaker] :: b_a10N -> [a_a10O] -> b_a10N
[LclId, Arity=2, Str=DmdType <L,U><S,1*U>]
go_s1Am =
\ (b1_avT :: b_a10N) (ds_d1xQ :: [a_a10O]) ->
-- Removed the actual definition of go for brevity,
-- it's the same in both cases
}; } in
go_s1Am
The relevant difference is in the very last line. The optimizer takes away the step of actually having to call foldlp
to call go
and just makes a function with two arguments out of foldlp that returns a function with two arguments. With inlining this optimization is not performed and the core looks exactly like the code you wrote.
I verified this by writing three variants of foldlp
:
module Main where
foldlp :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlp f p = go where
go b [] = b
go b (x : xs)
| p b = go (f b x) xs
| otherwise = b
{-# INLINE foldlpInline #-}
foldlpInline :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlpInline f p = go where
go b [] = b
go b (x : xs)
| p b = go (f b x) xs
| otherwise = b
{-# INLINE foldlp' #-} -- So that the code is not optimized
foldlp' b [] = b
foldlp' b (x : xs)
| (/= 0) b = foldlp' (mult b x) xs
| otherwise = b
mult :: Integer -> Integer -> Integer
mult 0 _ = 0
mult x y = x * y
--main = print $ foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
--main = print $ foldlpInline mult (/= 0) 1 $ replicate (10 ^ 7) 1
main = print $ foldlp' 1 $ replicate (10 ^ 7) 1
The results are:
First case (normal non-inlined):
./test 0,42s user 0,01s system 96% cpu 0,446 total
Second case (inlined):
./test 0,83s user 0,02s system 98% cpu 0,862 total
Third case (what the compiler produces for non-inlined)
./test 0,42s user 0,01s system 99% cpu 0,432 total
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With