Lets say I'm given two functions:
f :: [a] -> b
g :: [a] -> c
I want to write a function that is the equivalent of this:
h x = (f x, g x)
But when I do that, for large lists inevitably I run out of memory.
A simple example is the following:
x = [1..100000000::Int]
main = print $ (sum x, product x)
I understand this is the case because the list x
is being stored in memory without being garbage collected. It would be better instead of f
and g
worked on x
in, well, "parallel".
Assuming I can't change f
and g
, nor want to make a separate copy of x
(assume x
is expensive to produce) how can I write h
without running into out of memory issues?
A short answer is you can't. Since you have no control over f
and g
, you have no guarantee that the functions process their input sequentially. Such a function can as well keep the whole list stored in memory before producing the final result.
However, if your functions are expressed as folds, the situation is different. This means that we know how to incrementally apply each step, so we can parallelize those steps in one run.
The are many resources about this area. For example:
The pattern of consuming a sequence of values with properly defined space bounds is solved more generally with pipe-like libraries such conduit, iteratees or pipes. For example, in conduit, you could express the combination of computing sums and products as
import Control.Monad.Identity
import Data.Conduit
import Data.Conduit.List (fold, sourceList)
import Data.Conduit.Internal (zipSinks)
product', sum' :: (Monad m, Num a) => Sink a m a
sum' = fold (+) 0
product' = fold (*) 1
main = print . runIdentity $ sourceList (replicate (10^6) 1) $$
zipSinks sum' product'
If you can turn your functions into folds, you can then just use them with a scan:
x = [1..100000000::Int]
main = mapM_ print . tail . scanl foo (a0,b0) . takeWhile (not.null)
. unfoldr (Just . splitAt 1000) -- adjust the chunk length as needed
$ x
foo (a,b) x = let a2 = f' a $ f x ; b2 = g' b $ g x
in a2 `seq` b2 `seq` (a2, b2)
f :: [t] -> a -- e.g. sum
g :: [t] -> b -- (`rem` 10007) . product
f' :: a -> a -> a -- e.g. (+)
g' :: b -> b -> b -- ((`rem` 10007) .) . (*)
we consume the input in chunks for better performance. Compiled with -O2
, this should run in a constant space. The interim results are printed as indication of progress.
If you can't turn your function into a fold, this means it has to consume the whole list to produce any output and this trick doesn't apply.
You can use multiple threads to evaluate f x
and g x
in parallel.
E.g.
x :: [Int]
x = [1..10^8]
main = print $ let a = sum x
b = product x
in a `par` b `pseq` (a,b)
Its a nice way to exploit GHC's parallel runtime to prevent a space leak by doing two things at once.
Alternatively, you need to fuse f
and g
into a single pass.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With