My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example.
The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like:
parse :: Lazy.ByteString -> [LogEntry]
Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as:
totalEntries = length
nrBots = sum . map fromEnum . map isBotEntry
averageTimeOfDay = histogram . map extractHour
All of these are of the form foldl' k z . map f
.
The problem is that if I try to use them in the most natural way, like
main = do
input <- Lazy.readFile "input.txt"
let logEntries = parse input
totalEntries' = totalEntries logEntries
nrBots' = nrBots logEntries
avgTOD = averageTimeOfDay logEntries
print totalEntries'
print nrBots'
print avgTOD
This will allocate the whole list in memory, which is not what I want. I want the folds to be done synchronously, so that the cons cells can be garbage collected. If I compute only a single statistic, this is what happens.
I can write a single big function that does this, but it is non-composable code.
Alternatively, which is what I have been doing, I run each pass separately, but this reloads & uncompresses the file each time.
This a comment on the comment of sdcvvc referring to this 'beautiful folding' essay It was so cool -- beautiful, as he says -- I couldn't resist adding Functor
and Applicative
instances and a few other bits of modernization. Simultaneous folding of, say, x
y
and z
is a straightforward product: (,,) <$> x <*> y <*> z
. I made a half-gigabyte file of small random ints and it took 10 seconds to give the -- admittedly trivial -- calculation of length, sum and maximum on my rusty laptop. It doesn't seem to be helped by further annotations, but the compiler could see Int
was all I was interested in; the obvious map read . lines
as a parser led to a hopeless space and time catastrophe so I unfolded with a crude use of ByteString.readInt
; otherwise it is basically a Data.List
process.
{-# LANGUAGE GADTs, BangPatterns #-}
import Data.List (foldl', unfoldr)
import Control.Applicative
import qualified Data.ByteString.Lazy.Char8 as B
main = fmap readInts (B.readFile "int.txt") >>= print . fold allThree
where allThree = (,,) <$> length_ <*> sum_ <*> maximum_
data Fold b c where F :: (a -> b -> a) -> a -> (a -> c) -> Fold b c
data Pair a b = P !a !b
instance Functor (Fold b) where fmap f (F op x g) = F op x (f . g)
instance Applicative (Fold b) where
pure c = F const () (const c)
(F f x c) <*> (F g y c') = F (comb f g) (P x y) (c *** c')
where comb f g (P a a') b = P (f a b) (g a' b)
(***) f g (P x y) = f x ( g y)
fold :: Fold b c -> [b] -> c
fold (F f x c) bs = c $ (foldl' f x bs)
sum_, product_ :: Num a => Fold a a
length_ :: Fold a Int
sum_ = F (+) 0 id
product_ = F (*) 1 id
length_ = F (const . (+1)) 0 id
maximum_ = F max 0 id
readInts = unfoldr $ \bs -> case B.readInt bs of
Nothing -> Nothing
Just (n,bs2) -> if not (B.null bs2) then Just (n,B.tail bs2)
else Just (n,B.empty)
Edit: unsurprisingly, since we have to do with an unboxed type above, and an unboxed vector derived from e.g. a 2G file can fit in memory, this is all twice as fast and somewhat better behaved if it is given the obvious relettering for Data.Vector.Uboxed http://hpaste.org/69270 Of course this isn't relevant where one has types like LogEntry
Note though that the Fold
type and Fold 'multiplication' generalizes over sequential types without revision, thus e.g. the Folds associated with operations on Char
s or Word8
s can be simultaneously folded directly over a ByteString. One must first define a foldB
, by relettering fold
to use the foldl'
s in the various ByteString modules. But the Fold
s and products of Fold
s are the same ones you would fold a list or vector of Char
s or Word8
s
To process lazy data muiltiple times, in constant space, you can do three things:
par
to do n parallel traversals at the same timeThose are your options. The last one is the coolest :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With