I'd like to understand why this simple parser runs out of memory for large files. I'm really clueless what am I doing wrong.
import Data.Attoparsec.ByteString.Char8
import qualified Data.Attoparsec.ByteString.Lazy as Lazy
import System.Environment
import qualified Data.ByteString.Lazy as B
import Control.Applicative
parseLine :: Parser String
parseLine = manyTill' anyChar (endOfLine <|> endOfInput)
parseAll :: Parser [Int]
parseAll = manyTill'
(parseLine >> (return 0)) -- discarding what's been read
endOfInput
main :: IO()
main = do
[fn] <- getArgs
text <- B.readFile fn
case Lazy.parse parseAll text of
Lazy.Fail _ _ _ -> putStrLn "bad"
Lazy.Done _ _ -> putStrLn "ok"
I'm running the program with:
runhaskell.exe test.hs x.log
Output:
test.hs: Out of memory
x.log is about 500MB in size. My machine has 16GB of RAM.
If you look at the documentation of attoparsec you'll notice that there is a similar example and it is accompanied by the following comment:
Note the overlapping parsers
anyChar
andstring "-->"
. While this will work, it is not very efficient, as it will cause a lot of backtracking.
Using an alternative to anyChar
which rejects the characters accepted by endOfLine
should fix the issue. E.g.
satisfy (\c -> c `notElem` ['\n', '\r'])
I'm not that familiar with Attoparsec, but I think you might have a difficult time using it, alone, to parse a huge file in constant memory. If you replace your top-level parser parseAll
with:
parseAll :: Parser ()
parseAll = skipMany anyChar
and profile it, you'll find that memory usage still grows without bound. (And when I converted your code to use incremental reading with strict ByteString
s, it didn't make any difference.)
I believe the problem is this: because Attoparsec does automatic backtracking, it has to be prepared for parseAll
(your version or mine -- it doesn't matter) to be used like this:
(parseAll <* somethingThatDoesntMatch) <|> parseDifferently
If parseAll
has parsed half a million lines and reaches the end, somethingThatDoesntMatch
will cause it to backtrack all the way back to the beginning and then reparse everything with parseDifferently
. So, the meta information for backtracking and the ByteStrings themselves can't be freed until the parse is completely finished.
Now, your parser (and my example above), "obviously" won't need to backtrack this way, but Attoparsec doesn't deduce this.
I can think of a couple of ways to proceed:
try
).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With