Interleaving list functions

Question

Lets say I'm given two functions:

f :: [a] -> b
g :: [a] -> c

I want to write a function that is the equivalent of this:

h x = (f x, g x)

But when I do that, for large lists inevitably I run out of memory.

A simple example is the following:

x = [1..100000000::Int] 
main = print $ (sum x, product x)

I understand this is the case because the list x is being stored in memory without being garbage collected. It would be better instead of f and g worked on x in, well, "parallel".

Assuming I can't change f and g, nor want to make a separate copy of x (assume x is expensive to produce) how can I write h without running into out of memory issues?

Petr · Accepted Answer

A short answer is you can't. Since you have no control over f and g, you have no guarantee that the functions process their input sequentially. Such a function can as well keep the whole list stored in memory before producing the final result.

However, if your functions are expressed as folds, the situation is different. This means that we know how to incrementally apply each step, so we can parallelize those steps in one run.

The are many resources about this area. For example:

Haskell: Can I perform several folds over the same lazy list without keeping list in memory?
Classic Beautiful folding
More beautiful fold zipping

The pattern of consuming a sequence of values with properly defined space bounds is solved more generally with pipe-like libraries such conduit, iteratees or pipes. For example, in conduit, you could express the combination of computing sums and products as

import Control.Monad.Identity
import Data.Conduit
import Data.Conduit.List (fold, sourceList)
import Data.Conduit.Internal (zipSinks)

product', sum' :: (Monad m, Num a) => Sink a m a
sum'     = fold (+) 0
product' = fold (*) 1

main = print . runIdentity $ sourceList (replicate (10^6) 1) $$
                                zipSinks sum' product'

Will Ness · Answer

If you can turn your functions into folds, you can then just use them with a scan:

x = [1..100000000::Int] 
main = mapM_ print . tail . scanl foo (a0,b0) . takeWhile (not.null)  
         . unfoldr (Just . splitAt 1000)  -- adjust the chunk length as needed
         $ x

foo (a,b) x = let a2 = f' a $ f x ; b2 = g' b $ g x
              in a2 `seq` b2 `seq` (a2, b2)

f :: [t] -> a         -- e.g. sum
g :: [t] -> b         --      (`rem` 10007) . product
f' :: a -> a -> a     -- e.g. (+)
g' :: b -> b -> b     --      ((`rem` 10007) .) . (*)

we consume the input in chunks for better performance. Compiled with -O2, this should run in a constant space. The interim results are printed as indication of progress.

If you can't turn your function into a fold, this means it has to consume the whole list to produce any output and this trick doesn't apply.

Don Stewart · Answer

You can use multiple threads to evaluate f x and g x in parallel.

E.g.

x :: [Int]
x = [1..10^8]

main = print $ let a = sum x
                   b = product x
               in a `par` b `pseq` (a,b)

Its a nice way to exploit GHC's parallel runtime to prevent a space leak by doing two things at once.

Alternatively, you need to fuse f and g into a single pass.

Interleaving list functions

Tags:

haskell

Clinton

3 Answers

Petr

Will Ness

Don Stewart

Recent Activity

Donate For Us

Interleaving list functions

Tags:

haskell

Clinton

3 Answers

Petr

Will Ness

Don Stewart

Related questions

Recent Activity

Donate For Us