Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell ByteStrings - ending up with large file loaded into memory

Greetings,

I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.

EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'

Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.

(NB the source file is tab-separated lines)

Thanks in advance,

Kevin

module Main where

import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad


type Record = BS.ByteString

importRecords :: String -> IO [Record]
importRecords filename = do
    liftM (map importRecord.BS.lines) (BS.readFile filename)

importRecord :: BS.ByteString -> Record
importRecord txt = r
  where 
    r = getField 26
    getField f = BS.copy $ ((BS.split '\t' txt) !! f)

loopInput :: [Record] -> IO ()
loopInput jrs = do
    putStrLn $ "Done" ++ (show $ last jrs)
    hFlush stdout
    x <- getLine
    return ()

    -- (***)
    loopInput jrs

main = do 
    jrs <- importRecords "c:\\downloads\\lcg1m.txt"
    loopInput jrs
like image 287
Kevin Avatar asked Oct 13 '22 22:10

Kevin


2 Answers

Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.

The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:

type Year   = Word16
type Month  = Word8
type Day    = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day 
        deriving (Eq, Ord, Show, Read)

This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.

importRecord now has to call toRecord r to get the right type:

toRecord :: BS.ByteString -> Record
toRecord bs =
    case BS.splitWith (== '-') bs of
            (y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
            _ -> Rec 0 0 0

rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack

We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.

instance NFData Record where
    rnf (Rec y m d) = y `seq` m `seq` d `seq` ()

Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:

main = do
    jrs <- importRecords "./tabLines"
    let jrs' = using jrs (evalList rdeepseq)
    loopInput jrs'

And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).

alt text

Sorry about that other misleading wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.

FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.

like image 118
Thomas M. DuBuisson Avatar answered Dec 22 '22 01:12

Thomas M. DuBuisson


There seem to be two questions here:

  • why does the memory usage depend on the presence or absence of the line (***);
  • why is the memory usage with (***) present about 800MB, rather than, say, 40MB.

I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)

As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like

map' f []     = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)

Replace the map in importRecords with map' and see whether that reduces your memory usage.

like image 29
Reid Barton Avatar answered Dec 22 '22 03:12

Reid Barton