Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the lazy pattern match version of splitAt function faster?

The splitAt function can be implemented as following (https://wiki.haskell.org/Lazy_pattern_match):

import Prelude hiding (splitAt)
splitAt :: Int -> [a] -> ([a], [a])
splitAt n xs =
  if n<=0
     then ([], xs)
     else
        case xs of
          [] -> ([], [])
          y:ys ->
            case splitAt (n-1) ys of
              ~(prefix, suffix) -> (y : prefix, suffix) -- Here using lazy pattern match

main :: IO ()
main = do
  let r = splitAt 1000000 $ repeat 'a'
  print $ length $ fst r

And using the strict pattern match can slow the speed very much.

time ./lazy     -- 0.04s user 0.00s system 90% cpu 0.047 total
time ./strict   -- 0.12s user 0.02s system 96% cpu 0.147 total

I cannot get the reason. According to the article, the strict version may need more memory and all recursive calls to check whether the pattern fits. But I think the lazy version also needs all recursive calls and needs memory to contain the result of recursive function calls. What makes that differences?

like image 707
hliu Avatar asked Feb 10 '17 02:02

hliu


1 Answers

There are a bunch of differences.

Let's look into the operational difference between the variants with and without the ~ on line 11.

Evaluation in GHC Haskell is driven by pattern-matching. When a pattern is matched in a case expression or the LHS of a function definition, it requires that the constructors in the pattern be evaluated. (Patterns in let and where bindings are treated as lazy pattern matches.) So that means evaluating splitAt 1000000 (repeat 'a') depends on matching the (,) constructor resulting from the recursive call to splitAt 999999 ..., and so on, all the way down to the final call of splitAt 0 .... This requires stack space to evaluate. Quite a lot of it, in fact. It's very possible the stack has to be grown several times to avoid crashing.

In addition, the entire result string "aaaaa..." is built on the heap while this is happening, before length ever starts to process it. (Due to optimizations in repeat, the second element of the result is actually a circularly linked list which never allocates anything new in the whole recursive evaluation.)

When the pattern match is made lazy, things change. The return value from splitAt 1000000 (repeat 'a') is evaluated to be ('a':_thunk1, _thunk2) without ever making a recursive call to splitAt. This is a pattern known as guarded corecursion. The further evaluation is hidden behind data constructors like (,) and (:), and so only performed if demanded by another case expression.

The call to fst throws away _thunk2, so it won't ever be evaluated. The call to length starts by consuming the first (:) constructor, throwing out the 'a' value, and then making a recursive call on _thunk1. At this point, nothing in memory is still referring to the (:) constructor, so the garbage collector is free to reclaim it when it runs next. (The 'a' value is shared, so there are still pointers to it, so it neither gets collected nor allocated during this process.)

What happens when _thunk1 is evaluated is a bit subtle. It makes the recursive call to splitAt 999999 .... It results in ('a':_thunk3, _thunk4). Nothing is holding on to _thunk4, so it's free to be garbage collected whenever. Evaluation of length proceeds as above. The (:) constructor is no longer held in memory, and is free to be collected.

Evaluation proceeds that way, only ever holding on to a single (:) constructor on the heap at a time, and without burning any stack space at all. And GHC's garbage collector's run time depends on the size of the resident set. As there is at most one (:) constructor resident, it goes really fast in this case.

I suspect in this case, that's the speed difference you're seeing. You should try running the two programs with the argument +RTS -s and looking at stats regarding maximum resident size and garbage collector run time.

It's possible GHC is being really clever in optimization, though. I haven't checked it out, but I know in some cases it can rewrite algorithms in terms of explicit (:) application in terms of the build function. If it were to do so, it would allow foldr/build fusion to remove the allocation of the (:) constructors entirely. (Yes, length is defined in terms of foldr via some really cool tricks for efficiency, mostly to allow foldr/build fusion to work.)

If that's the case, you'd find even less allocation taking place in the lazy case - but the strict case would be just as bad. I don't think that's likely to be happening here, but I'm not sure, as I haven't tested.

like image 112
Carl Avatar answered Oct 18 '22 15:10

Carl