I'd like to understand why this simple parser runs out of memory for large files. I'm really clueless what am I doing wrong.
import Data.Attoparsec.ByteString.Char8
import qualified Data.Attoparsec.ByteString.Lazy as Lazy
import System.Environment
import qualified Data.ByteString.Lazy as B
import Control.Applicative
parseLine :: Parser String
parseLine = manyTill' anyChar (endOfLine <|> endOfInput)
parseAll :: Parser [Int]
parseAll = manyTill' 
        (parseLine >> (return 0)) -- discarding what's been read
        endOfInput
main :: IO()
main = do 
        [fn] <- getArgs
        text <- B.readFile fn
        case Lazy.parse parseAll text of
                Lazy.Fail _ _ _ -> putStrLn "bad"
                Lazy.Done _ _ -> putStrLn "ok" 
I'm running the program with:
 runhaskell.exe test.hs x.log
Output:
test.hs: Out of memory
x.log is about 500MB in size. My machine has 16GB of RAM.
If you look at the documentation of attoparsec you'll notice that there is a similar example and it is accompanied by the following comment:
Note the overlapping parsers
anyCharandstring "-->". While this will work, it is not very efficient, as it will cause a lot of backtracking.
Using an alternative to anyChar which rejects the characters accepted by endOfLine should fix the issue. E.g.
satisfy (\c -> c `notElem` ['\n', '\r'])
                        I'm not that familiar with Attoparsec, but I think you might have a difficult time using it, alone, to parse a huge file in constant memory.  If you replace your top-level parser parseAll with:
parseAll :: Parser ()
parseAll = skipMany anyChar
and profile it, you'll find that memory usage still grows without bound.  (And when I converted your code to use incremental reading with strict ByteStrings, it didn't make any difference.)
I believe the problem is this:  because Attoparsec does automatic backtracking, it has to be prepared for parseAll (your version or mine -- it doesn't matter) to be used like this:
(parseAll <* somethingThatDoesntMatch) <|> parseDifferently
If parseAll has parsed half a million lines and reaches the end, somethingThatDoesntMatch will cause it to backtrack all the way back to the beginning and then reparse everything with parseDifferently.  So, the meta information for backtracking and the ByteStrings themselves can't be freed until the parse is completely finished.
Now, your parser (and my example above), "obviously" won't need to backtrack this way, but Attoparsec doesn't deduce this.
I can think of a couple of ways to proceed:
try).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With