Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IO over big files in haskell: Performance issue

I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.

import System 
import qualified Data.ByteString.Lazy as BL 
import Data.Word  
import Data.List

main :: IO () 
main =     
    do         
        args <- System.getArgs         
        let filename = head args         
        byteString <- BL.readFile filename         
        let wordsList = BL.unpack byteString         
        let foldFun acc word = doSomeStuff word : acc
        let wordsListCopy = foldl' foldFun [] wordsList
        let byteStringCopy = BL.pack (reverse wordsListCopy)
        BL.writeFile (filename ++ ".cpy") byteStringCopy
    where
        doSomeStuff = id

I name this file TestCopy.hs, then do the following:

$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root  5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs 
[1 of 1] Compiling Main             ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB

real    0m5.631s
user    0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB 

real    3m6.671s
user    0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy 
$ time ./TestCopy 10MB +RTS -K500M -RTS

real    2m50.261s
user    0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy 
$ 

My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.

Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff function shall perform compression/decompression actions that I have to customize.

like image 633
Joel Avatar asked Mar 24 '11 13:03

Joel


2 Answers

For chunked input processing I would use the enumerator package.

import Data.Enumerator
import Data.Enumerator.Binary (enumFile)

We use bytestrings

import Data.ByteString as BS

and IO

import Control.Monad.Trans (liftIO)
import Control.Monad (mapM_)
import System (getArgs)

Your main function could look like following:

main =
  do (filepath:_) <- getArgs
     let destination
     run_ $ enumFile filepath $$ writeFile (filepath ++ ".cpy")

enumFile reads 4096 bytes per chunk and passes these to writeFile, which writes it down.

enumWrite is defined as:

enumWrite :: FilePath -> Iteratee BS.ByteString IO ()
enumWrite filepath =
  do liftIO (BS.writeFile filepath BS.empty)   -- ensure the destination is empty
     continue step
  where
  step (Chunks xs) =
    do liftIO (mapM_ (BS.appendFile filepath) xs)
       continue step
  step EOF         = yield () EOF

As you can see, the step function takes chunks of bytestrings and appends them to the destination file. These chunks have the type Stream BS.Bytestring, where Stream is defined as:

data Stream a = Chunks [a] | EOF

On an EOF step terminates, yielding ().

To have a much more elaborate read on this I personally recommend Michael Snoymans tutorial

The numbers

$ time ./TestCopy 5MB
./TestCopy 5MB  2,91s user 0,32s system 96% cpu 3,356 total

$ time ./TestCopy2 5MB
./TestCopy2 5MB  0,04s user 0,03s system 93% cpu 0,075 total

That's quite an improvement. Now in order to implement your fold you probably want to write an Enumeratee, which is used to transform a input stream. Fortunately there is already a map function defined in the enumerator package, which can be modified for your need, i.e. it can be modified to carry over state.

On the construction of the intermediate result

You construct wordsList in reverse order and reverse it afterwards. I think difference lists do a better job, because appends take only O(1) time due to the fact that appending is only a function composition. I'm not sure whether they takes more space though. Here's a rough sketch of difference lists:

type DList a = [a] -> [a]

emptyList :: DList a
emptyList = id

snoc :: DList a -> a -> DList a
snoc dlist a = dlist . (a:)

toList :: DList a -> [a]
toList dlist = dlist []

This answer is probably not needed anymore, but I added it for completeness.

like image 119
Long Avatar answered Oct 26 '22 02:10

Long


I take it this is a follow on to Reading large file in haskell? from yesterday.

Try compiling with "-rtsopts -O2" instead of just "-O".

You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.

With my system I see "ghc -rtsopts --make -O2 b.hs" giving

(! 741)-> time ./b five
real 0m2.789s user 0m2.235s sys 0m0.518s
(! 742)-> time ./b ten
real 0m5.830s user 0m4.623s sys 0m1.027s

Which now looks linear to me.

like image 35
Chris Kuklewicz Avatar answered Oct 26 '22 02:10

Chris Kuklewicz