Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Benefit of avoiding multiple list traversals

I've seen many examples in functional languages about processing a list and constructing a function to do something with its elements after receiving some additional value (usually not present at the time the function was generated), such as:

  • Calculating the difference between each element and the average

    (the last 2 examples under "Lazy Evaluation")

  • Staging a list append in strict functional languages such as ML/OCaml, to avoid traversing the first list more than once

    (the section titled "Staging")

  • Comparing a list to another with foldr (i.e. generating a function to compare another list to the first)

    listEq a b = foldr comb null a b   where comb x frec [] = False         comb x frec (e:es) = x == e && frec es cmp1To10 = listEq [1..10] 

In all these examples, the authors generally remark the benefit of traversing the original list only once. But I can't keep myself from thinking "sure, instead of traversing a list of N elements, you are traversing a chain of N evaluations, so what?". I know there must be some benefit to it, could someone explain it please?


Edit: Thanks to both for the answers. Unfortunately, that's not what I wanted to know. I'll try to clarify my question, so it's not confused with the (more common) one about creating intermediate lists (which I already read about in various places). Also thanks for correcting my post formatting.

I'm interested in the cases where you construct a function to be applied to a list, where you don't yet have the necessary value to evaluate the result (be it a list or not). Then you can't avoid generating references to each list element (even if the list structure is not referenced anymore). And you have the same memory accesses as before, but you don't have to deconstruct the list (pattern matching).

For example, see the "staging" chapter in the mentioned ML book. I've tried it in ML and Racket, more specifically the staged version of "append" which traverses the first list and returns a function to insert the second list at the tail, without traversing the first list many times. Surprisingly for me, it was much faster even considering it still had to copy the list structure as the last pointer was different on each case.

The following is a variant of map which after applied to a list, it should be faster when changing the function. As Haskell is not strict, I would have to force the evaluation of listMap [1..100000] in cachedList (or maybe not, as after the first application it should still be in memory).

listMap = foldr comb (const [])   where comb x rest = \f -> f x : rest f  cachedList = listMap [1..100000] doubles = cachedList (2*) squares = cachedList (\x -> x*x)  -- print doubles and squares -- ... 

I know in Haskell it doesn't make a difference (please correct me if I'm wrong) using comb x rest f = ... vs comb x rest = \f -> ..., but I chose this version to emphasize the idea.

Update: after some simple tests, I couldn't find any difference in execution times in Haskell. The question then is only about strict languages such as Scheme (at least the Racket implementation, where I tested it) and ML.

like image 619
Alejandro Pulver Avatar asked Dec 03 '12 15:12

Alejandro Pulver


1 Answers

Executing a few extra arithmetic instructions in your loop body is cheaper than executing a few extra memory fetches, basically.

Traversals mean doing lots of memory access, so the less you do, the better. Fusion of traversals reduces memory traffic, and increases the straight line compute load, so you get better performance.

Concretely, consider this program to compute some math on a list:

go :: [Int] -> [Int] go = map (+2) . map (^3) 

Clearly, we design it with two traversals of the list. Between the first and the second traversal, a result is stored in an intermediate data structure. However, it is a lazy structure, so only costs O(1) memory.

Now, the Haskell compiler immediately fuses the two loops into:

go = map ((+2) . (^3)) 

Why is that? After all, both are O(n) complexity, right? The difference is in the constant factors.

Considering this abstraction: for each step of the first pipeline we do:

  i <- read memory          -- cost M   j = i ^ 3                 -- cost A   write memory j            -- cost M   k <- read memory          -- cost M   l = k + 2                 -- cost A   write memory l            -- cost M 

so we pay 4 memory accesses, and 2 arithmetic operations.

For the fused result we have:

  i <- read memory          -- cost M   j = (i ^ 3) + 2           -- cost 2A   write memory j            -- cost M 

where A and M are the constant factors for doing math on the ALU and memory access.

There are other constant factors as well (two loop branches) instead of one.

So unless memory access is free (it is not, by a long shot) then the second version is always faster.

Note that compilers that operate on immutable sequences can implement array fusion, the transformation that does this for you. GHC is such a compiler.

like image 94
Don Stewart Avatar answered Oct 16 '22 23:10

Don Stewart