Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a 7GB file, with Data.ByteString?

I have to parse a file, and indeed a have to read it first, here is my program :

import qualified Data.ByteString.Char8 as B
import System.Environment    

main = do
 args      <- getArgs
 let path  =  args !! 0
 content   <- B.readFile path
 let lines = B.lines content
 foobar lines 

 foobar :: [B.ByteString] -> IO()
 foobar _ = return ()

but, after the compilation

> ghc --make -O2 tmp.hs 

the execution goes through the following error when called with a 7Gigabyte file.

> ./tmp  big_big_file.dat
> tmp: {handle: big_big_file.dat}: hGet: illegal ByteString size (-1501792951): illegal operation

thanks for any reply!

like image 467
Fopa Léon Constantin Avatar asked Apr 04 '12 13:04

Fopa Léon Constantin


2 Answers

The length of ByteStrings are Int. If Int is 32 bits, a 7GB file will exceed the range of Int and the buffer request will be for a wrong size and can easily request a negative size.

The code for readFile converts the file size to Int for the buffer request

readFile :: FilePath -> IO ByteString
readFile f = bracket (openBinaryFile f ReadMode) hClose
    (\h -> hFileSize h >>= hGet h . fromIntegral)

and if that overflows, an "illegal ByteString size" error or a segmentation fault are the most likely outcomes.

If at all possible, use lazy ByteStrings to handle files that big. In your case, you pretty much have to make it possible, since with 32 bit Ints, a 7GB ByteString is impossible to create.

If you need the lines to be strict ByteStrings for the processing, and no line is exceedingly long, you can go through lazy ByteStrings to achieve that

import qualified Data.ByteString.Lazy.Char8 as LC
import qualified Data.ByteString.Char8 as C

main = do
    ...
    content <- LC.readFile path
    let llns = LC.lines content
        slns = map (C.concat . LC.toChunks) llns
    foobar slns

but if you can modify your processing to deal with lazy ByteStrings, that will probably be better overall.

like image 100
Daniel Fischer Avatar answered Oct 05 '22 22:10

Daniel Fischer


Strict ByteStrings only support up to 2 GiB of memory. You need to use lazy ByteStrings for it to work.

like image 40
dflemstr Avatar answered Oct 05 '22 23:10

dflemstr