Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interleaving list functions

Tags:

haskell

Lets say I'm given two functions:

f :: [a] -> b
g :: [a] -> c

I want to write a function that is the equivalent of this:

h x = (f x, g x)

But when I do that, for large lists inevitably I run out of memory.

A simple example is the following:

x = [1..100000000::Int] 
main = print $ (sum x, product x)

I understand this is the case because the list x is being stored in memory without being garbage collected. It would be better instead of f and g worked on x in, well, "parallel".

Assuming I can't change f and g, nor want to make a separate copy of x (assume x is expensive to produce) how can I write h without running into out of memory issues?

like image 751
Clinton Avatar asked Jun 05 '13 07:06

Clinton


3 Answers

A short answer is you can't. Since you have no control over f and g, you have no guarantee that the functions process their input sequentially. Such a function can as well keep the whole list stored in memory before producing the final result.

However, if your functions are expressed as folds, the situation is different. This means that we know how to incrementally apply each step, so we can parallelize those steps in one run.

The are many resources about this area. For example:

  • Haskell: Can I perform several folds over the same lazy list without keeping list in memory?
  • Classic Beautiful folding
  • More beautiful fold zipping

The pattern of consuming a sequence of values with properly defined space bounds is solved more generally with pipe-like libraries such conduit, iteratees or pipes. For example, in conduit, you could express the combination of computing sums and products as

import Control.Monad.Identity
import Data.Conduit
import Data.Conduit.List (fold, sourceList)
import Data.Conduit.Internal (zipSinks)

product', sum' :: (Monad m, Num a) => Sink a m a
sum'     = fold (+) 0
product' = fold (*) 1

main = print . runIdentity $ sourceList (replicate (10^6) 1) $$
                                zipSinks sum' product'
like image 155
Petr Avatar answered Oct 21 '22 17:10

Petr


If you can turn your functions into folds, you can then just use them with a scan:

x = [1..100000000::Int] 
main = mapM_ print . tail . scanl foo (a0,b0) . takeWhile (not.null)  
         . unfoldr (Just . splitAt 1000)  -- adjust the chunk length as needed
         $ x

foo (a,b) x = let a2 = f' a $ f x ; b2 = g' b $ g x
              in a2 `seq` b2 `seq` (a2, b2)

f :: [t] -> a         -- e.g. sum
g :: [t] -> b         --      (`rem` 10007) . product
f' :: a -> a -> a     -- e.g. (+)
g' :: b -> b -> b     --      ((`rem` 10007) .) . (*)

we consume the input in chunks for better performance. Compiled with -O2, this should run in a constant space. The interim results are printed as indication of progress.

If you can't turn your function into a fold, this means it has to consume the whole list to produce any output and this trick doesn't apply.

like image 36
Will Ness Avatar answered Oct 21 '22 15:10

Will Ness


You can use multiple threads to evaluate f x and g x in parallel.

E.g.

x :: [Int]
x = [1..10^8]

main = print $ let a = sum x
                   b = product x
               in a `par` b `pseq` (a,b) 

Its a nice way to exploit GHC's parallel runtime to prevent a space leak by doing two things at once.

Alternatively, you need to fuse f and g into a single pass.

like image 2
Don Stewart Avatar answered Oct 21 '22 16:10

Don Stewart