Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does adding INLINE slow down my program

Tags:

haskell

inline

I was looking into making a foldl that worked on infinite lists, for situations where you could not obtain guarded recursion but where depending the first argument the second argument might not be used.

For example multiplication, where usually you need both arguments and guarded recursion does not work, but if the first argument is 0 you can short circuit.

So I wrote the following function:

foldlp :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlp f p = go where
    go b [] = b
    go b (x : xs) 
        | p b = go (f b x) xs
        | otherwise = b

And tested it with my custom short circuiting multiplication function:

 mult :: Integer -> Integer -> Integer
 mult 0 _ = 0
 mult x y = x * y

 main :: IO ()
 main = print . <test_function>

The results I got with -prof -fprof-auto -O2, +RTS -p were:

foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
total time = 0.40 secs
total alloc = 480,049,336 bytes

foldlp mult (`seq` True) 1 $ replicate (10 ^ 7) 1
total time = 0.37 secs
total alloc = 480,049,336 bytes

foldl' mult 1 $ replicate (10 ^ 7) 1
total time = 0.37 secs
total alloc = 480,049,352 bytes

foldl mult 1 $ replicate (10 ^ 7) 1
total time = 0.74 secs
total alloc = 880,049,352 bytes

foldr mult 1 $ replicate (10 ^ 7) 1
total time = 0.87 secs
total alloc = 880,049,336 bytes

Which was very promising, as my custom function allows for flexible types of strictness, and also works on infinite lists

The first example will terminate as soon as it hits a 0, as will foldr, but foldr is much slower.

It avoids issues such as thunks within tuples, seeing as ((1 + 2) + 3, (10 + 20) + 30) is technically in WHNF, breaking foldl'.

You can re-obtain foldl with flip foldl (const True) and foldl' with flip foldl (seqTrue). And the performance characteristics of the original restricted functions seem to be re-obtained by doing so.

So as a side note I think foldlp would be a worthwhile addition to Foldable.

But my actual question was why when I added {-# INLINE foldlp #-} the functions performance went down significantly, giving me:

foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
total time = 0.67 secs
total alloc = 800,049,336 bytes

So my real question is why that is happening. I thought the downside of inlining was code bloat, not significant negative effects on runtime performance and increased memory usage.

like image 686
semicolon Avatar asked Oct 12 '16 05:10

semicolon


1 Answers

According to the GHC docs the INLINE pragma prevents other compiler optimizations in order to still let rewrite rules take effect.

So my guess would be, that by using INLINE you remove some optimization, that GHC would have applied to make your code faster.

After some poking around in the core (use -ddump-simpl in compilation) I found the optimization, that GHC performs. For this I took a look at the core for foldlp with inlining and without inlining:

Inlined:

foldlp =
  \ (@ b_a10N)
    (@ a_a10O)
    (eta_B2 :: b_a10N -> a_a10O -> b_a10N)
    (eta1_B1 :: b_a10N -> Bool)
    (eta2_X3 :: b_a10N)
    (eta3_X5 :: [a_a10O]) ->
    letrec {
      go_s1Ao [Occ=LoopBreaker] :: b_a10N -> [a_a10O] -> b_a10N
      [LclId, Arity=2, Str=DmdType <L,U><S,1*U>]
      go_s1Ao =
        \ (b1_avT :: b_a10N) (ds_d1xQ :: [a_a10O]) ->
        -- Removed the actual definition of go for brevity,
        -- it's the same in both cases
          }; } in
    go_s1Ao eta2_X3 eta3_X5

Non-inlined:

foldlp =
  \ (@ b_a10N)
    (@ a_a10O)
    (f_avQ :: b_a10N -> a_a10O -> b_a10N)
    (p_avR :: b_a10N -> Bool) ->
    letrec {
      go_s1Am [Occ=LoopBreaker] :: b_a10N -> [a_a10O] -> b_a10N
      [LclId, Arity=2, Str=DmdType <L,U><S,1*U>]
      go_s1Am =
        \ (b1_avT :: b_a10N) (ds_d1xQ :: [a_a10O]) ->
        -- Removed the actual definition of go for brevity,
        -- it's the same in both cases
          }; } in
    go_s1Am

The relevant difference is in the very last line. The optimizer takes away the step of actually having to call foldlp to call go and just makes a function with two arguments out of foldlp that returns a function with two arguments. With inlining this optimization is not performed and the core looks exactly like the code you wrote.

I verified this by writing three variants of foldlp:

module Main where

foldlp :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlp f p = go where
      go b [] = b
      go b (x : xs)
        | p b = go (f b x) xs
        | otherwise = b

{-# INLINE foldlpInline #-}
foldlpInline :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlpInline f p = go where
      go b [] = b
      go b (x : xs)
        | p b = go (f b x) xs
        | otherwise = b


{-# INLINE foldlp' #-} -- So that the code is not optimized
foldlp' b [] = b
foldlp' b (x : xs)
        | (/= 0) b = foldlp' (mult b x) xs
        | otherwise = b

mult :: Integer -> Integer -> Integer
mult 0 _ = 0
mult x y = x * y

--main = print $ foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
--main = print $ foldlpInline mult (/= 0) 1 $ replicate (10 ^ 7) 1
main = print $ foldlp' 1 $ replicate (10 ^ 7) 1

The results are:

First case (normal non-inlined):

./test  0,42s user 0,01s system 96% cpu 0,446 total

Second case (inlined):

./test  0,83s user 0,02s system 98% cpu 0,862 total

Third case (what the compiler produces for non-inlined)

./test  0,42s user 0,01s system 99% cpu 0,432 total
like image 72
Christof Schramm Avatar answered Oct 29 '22 23:10

Christof Schramm