I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using <code>Data.Text.Lazy.IO</code> with <code>hGetLine</code> and that blows through my memory. I read later that eventually loads the whole file. I also tried using pipes-text with <code>folds</code> and <code>view lines</code>: <pre class="prettyprint"><code>s <- Pipes.sum $ folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle)) print s </code></pre> to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where <code>wc -l</code> takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB) I'm really open to any suggestions, can't find much searching except for newbie <code>readLine</code> how-tos. Thanks!

The following code uses Conduit, and will: <ul> <li>UTF8-decode standard input</li> <li>Run the <code>lineC</code> combinator as long as there is more data available</li> <li>For each line, simply <code>yield</code> the value <code>1</code> and discard the line content, without ever read the entire line into memory at once</li> <li>Sum up the <code>1</code>s yielded and print it</li> </ul> You can replace the <code>yield 1</code> code with something which will do processing on the individual lines. <pre class="prettyprint"><code>#!/usr/bin/env stack -- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators import Conduit main :: IO () main = (runConduit $ stdinC .| decodeUtf8C .| peekForeverE (lineC (yield (1 :: Int))) .| sumC) >>= print </code></pre>

Read large lines in huge file without buffering

Tags:

haskell

haskell-pipes

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.

I also tried using pipes-text with folds and view lines:

s <- Pipes.sum $ 
    folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s

to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)

I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.

Thanks!

513

asked Mar 08 '17 15:03

Charles Durham

2 Answers

The following code uses Conduit, and will:

UTF8-decode standard input
Run the lineC combinator as long as there is more data available
For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
Sum up the 1s yielded and print it

You can replace the yield 1 code with something which will do processing on the individual lines.

#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit

main :: IO ()
main = (runConduit
     $ stdinC
    .| decodeUtf8C
    .| peekForeverE (lineC (yield (1 :: Int)))
    .| sumC) >>= print

168

answered Sep 23 '22 07:09

Michael Snoyman

This is probably easiest as a fold over the decoded text stream

{-#LANGUAGE BangPatterns #-}
import Pipes 
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
  n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
  print n

It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.

You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

answered Sep 20 '22 07:09

Michael

Related questions
                            
                                Where does the name of Equational Reasoning come from?
                            
                                Haskell, Channels, STM, -threaded, Message Passing
                            
                                What is haskellng? What is the difference between 'haskellPackages' and 'haskellngPackages'?
                            
                                \r\n translated to \r\r\n in Haskell
                            
                                How do I export typeclasses?
                            
                                Example of a data structure with lazy spine and strict leaves
                            
                                Understanding the difference between two Haskell signatures, one using forall
                            
                                Hiding a State monad's s type parameter
                            
                                Haskell primPutChar definition
                            
                                Free monad and the free operation
                            
                                Why is pattern matching preferred in function definitions?
                            
                                Type ambiguity in Haskell type families
                            
                                Confusing about Haskell type inference
                            
                                Functor instance of State
                            
                                install ghcjs from stack
                            
                                How to work with AST with Cofree annotation?
                            
                                Avoiding ++ in Haskell
                            
                                Can non-persistent data structures be used in a purely functional way?
                            
                                What is the dot pipe ".|" operator in Haskell?
                            
                                Compile-time checked URIs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With