Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read large lines in huge file without buffering

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.

I also tried using pipes-text with folds and view lines:

s <- Pipes.sum $ 
    folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s

to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)

I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.

Thanks!

like image 513
Charles Durham Avatar asked Mar 08 '17 15:03

Charles Durham


People also ask

How do I read a 100gb file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

How do I open and read a large text file in Python?

To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.


2 Answers

The following code uses Conduit, and will:

  • UTF8-decode standard input
  • Run the lineC combinator as long as there is more data available
  • For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
  • Sum up the 1s yielded and print it

You can replace the yield 1 code with something which will do processing on the individual lines.

#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit

main :: IO ()
main = (runConduit
     $ stdinC
    .| decodeUtf8C
    .| peekForeverE (lineC (yield (1 :: Int)))
    .| sumC) >>= print
like image 168
Michael Snoyman Avatar answered Sep 23 '22 07:09

Michael Snoyman


This is probably easiest as a fold over the decoded text stream

{-#LANGUAGE BangPatterns #-}
import Pipes 
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
  n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
  print n

It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.

You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

like image 43
Michael Avatar answered Sep 20 '22 07:09

Michael