As far as I know !
(called bangs) are used to signal that an expression should be evaluated strictly. But it isn't that obvious for me where to put them or if at all.
import qualified Data.Vector.Unboxed as V
main :: IO ()
main = print $ mean (V.enumFromTo 1 (10^9))
mean :: V.Vector Double -> Double
The different versions of mean:
-- compiled with O2 ~ 1.14s
mean xs = acc / fromIntegral len
where !(len, acc) = V.foldl' f (0,0) xs :: (Int, Double)
f (len, acc) x = (len+1, acc+x)
-- compiled with O2 ~ 1.18s
mean xs = acc / fromIntegral len
where (!len, !acc) = V.foldl' f (0,0) xs :: (Int, Double)
f (len, acc) x = (len+1, acc+x)
-- compiled with O2 ~ 1.75s
mean xs = acc / fromIntegral len
where (len, acc) = V.foldl' f (0,0) xs :: (Int, Double)
f !(len, acc) x = (len+1, acc+x)
-- compiled with O2 ~ 1.75s
mean xs = acc / fromIntegral len
where (len, acc) = V.foldl' f (0,0) xs :: (Int, Double)
f (len, acc) x = (len+1, acc+x)
-- compiled without options ~ 6s
mean xs = acc / fromIntegral len
where (len, acc) = V.foldl' f (0,0) xs :: (Int, Double)
f (len, acc) x = (len+1, acc+x)
-- compiled without options ~ 12s
mean xs = acc / fromIntegral len
where !(len, acc) = V.foldl' f (0,0) xs :: (Int, Double)
f (len, acc) x = (len+1, acc+x)
Some of it makes sense intuitively, but I'd like it to be less of a trial and error approach.
Is there some way to detect when lazy evaluation will get in the way of performance? Other than testing every rigorously.
Does it only make sense for simple functions like mean
where everything should be evaluated in one go?
In your examples the bang patterns are moving around the final calculation of the average, or rather its ingredients:
where (!len, !acc) = V.foldl' f (0,0) xs :: (Int, Double)
where !(len, acc) = V.foldl' f (0,0) xs :: (Int, Double)
but (with one apparent exception) not the second item, the folding function itself:
f (len, acc) x = (len+1, acc+x)
But this f
is where the action is. In your examples, the different ways you annotate (len,acc)
seem to be triggering the compiler to adopt subtly different views of what to do with f
. This is why everything seems slightly occult. The thing to do is deal with f
directly.
The main bread-and-butter point is that in a left fold or accumulating loop, all accumulated material must be evaluated strictly. Otherwise you are basically just building up a big expression with foldl'
and then asking for it to be collapsed when you come finally to do something with the material you accumulated - here, with the final calculation of the average.
In your examples, though, foldl'
is never given an explicitly strict function to fold with: the accumulating len
and acc
are trapped inside a regular lazy Haskell tuple.
The problem of strictness is arising here because you are accumulating more than one thing, but need to tie them together into one argument for the f operation you are passing to foldl'
. This a typical case for writing a strict type or record to do the accumulation; this takes one short line
data P = P !Int !Double
then you can write
mean0 xs = acc / fromIntegral len
where P len acc = V.foldl' f (P 0 0) xs
f (P len acc) !x = P (len+1) (acc+x)
Note that I don't mark (P len acc)
with a bang since it is visibly in weak head normal form - you can see the P and don't need to ask the compiler to find it with !/seq - and thus f
is strict in the first argument. The same is true in the one case where you add strictness to f
f !(len, acc) x = (len+1, acc+x)
But the function
f (len, acc) x = (len+1, acc+x)
was already strict in the first argument, the pair, since you can see the outermost constructor (,)
, and don't need a strictness annotation (which basically just tells the compiler to find it.) But the constructor just constructs a lazy tuple so it wasn't (explicitly) strict in len
and acc
$ time ./foldstrict
5.00000000067109e8
real 0m1.495s
whereas on my machine your best mean is:
$ time ./foldstrict
5.00000000067109e8
real 0m1.963s
Not set in stone, but current best practice is to make all fields in data structures strict, but take function arguments and return results lazily (except accumulators).
The net effect is that as long as you don't touch a piece of return value, nothing is evaluated. As soon as you strictly need a tiny bit from it, the whole structure is evaluated at once, leading to more predictable memory/cpu usage patterns than if they were lazily evaluated throughout execution.
The performance guidelines by Johan Tibell are the best to point out subtleties: http://johantibell.com/files/haskell-performance-patterns.html#(1) . Note that recent GHCs perform small strict field unpacking automatically without the need to annotate. Also see the Strict pragmas.
About when to introduce the strict fields: do it right from the beginning, since it's much harder to bolt it retrospectively. You can still use lazy fields, but only when you explicitly want them.
Note: []
is lazy, and is more used as a control structure that is expected to inline, than as a container. Use vector
etc for the latter.
Note 2: there are specialized libs to let you deal with strict folding (see foldl), or with streaming computations (conduit, pipes).
A bit of elaboration on the rationale, so that 1) you know this is not just for the rubber duck from the sky 2) know when/why to deviate.
One case is strict accumulation, as outlined in the question. This comes in less obvious forms too - such as keeping count of certain events happening in the state of the app. If you don't store a strict count, you can get a long chain of +1
thunks build up, which consumes a lot of memory for no good reason (vs storing just the updated count).
Above is called a memory leak
informally, even if it's not technically a leak (no memory is lost, it is just held longer than needed).
An other case is concurrent computation, where the work is divided across multiple threads. Now, it is easy to run into situations where you think you forked out a computation to a separate thread (making your program very efficiently concurrent), only to realize later that the concurrent thread only computes the outermost layer of a lazy data structure, and the bulk of the computation still happens on your main thread when the value is forced.
A solution path for this is using NFData
from deepseq. But imagine having a final data structure layered A (B (C))
, where each layer is computed by a separate thread, deep forcing the structure before returning. Now C
is deep forced (in effect traversed in memory) three times, B
two times. If C
is a deep/large structure, this is a waste. At this point you can either add the Once trick, or just use a deeply strict data structure, where doing shallow forcing to WHNF (instead of to deep NF) has the same effect of deep forcing, but the Once-trick is taken care by the compiler, so to speak.
Now, if you are consistent and aware, you might be doing ok with deepseq+Once.
Note: a usecase very similar to concurrent evaluation is single-threaded evaluation in the scary case of pure errors, such as undefined
and error
. These are ideally not used, but if are, the ways to attack the problem are very similar to the above outlined (see by the way the spoon package).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With