Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple parser running out of memory

I'd like to understand why this simple parser runs out of memory for large files. I'm really clueless what am I doing wrong.

import Data.Attoparsec.ByteString.Char8
import qualified Data.Attoparsec.ByteString.Lazy as Lazy
import System.Environment
import qualified Data.ByteString.Lazy as B
import Control.Applicative

parseLine :: Parser String
parseLine = manyTill' anyChar (endOfLine <|> endOfInput)

parseAll :: Parser [Int]
parseAll = manyTill' 
        (parseLine >> (return 0)) -- discarding what's been read
        endOfInput

main :: IO()
main = do 
        [fn] <- getArgs
        text <- B.readFile fn

        case Lazy.parse parseAll text of
                Lazy.Fail _ _ _ -> putStrLn "bad"
                Lazy.Done _ _ -> putStrLn "ok" 

I'm running the program with:

 runhaskell.exe test.hs x.log

Output:

test.hs: Out of memory

x.log is about 500MB in size. My machine has 16GB of RAM.

like image 402
user3608068 Avatar asked Dec 29 '16 11:12

user3608068


2 Answers

If you look at the documentation of attoparsec you'll notice that there is a similar example and it is accompanied by the following comment:

Note the overlapping parsers anyChar and string "-->". While this will work, it is not very efficient, as it will cause a lot of backtracking.

Using an alternative to anyChar which rejects the characters accepted by endOfLine should fix the issue. E.g.

satisfy (\c -> c `notElem` ['\n', '\r'])
like image 192
gallais Avatar answered Sep 19 '22 19:09

gallais


I'm not that familiar with Attoparsec, but I think you might have a difficult time using it, alone, to parse a huge file in constant memory. If you replace your top-level parser parseAll with:

parseAll :: Parser ()
parseAll = skipMany anyChar

and profile it, you'll find that memory usage still grows without bound. (And when I converted your code to use incremental reading with strict ByteStrings, it didn't make any difference.)

I believe the problem is this: because Attoparsec does automatic backtracking, it has to be prepared for parseAll (your version or mine -- it doesn't matter) to be used like this:

(parseAll <* somethingThatDoesntMatch) <|> parseDifferently

If parseAll has parsed half a million lines and reaches the end, somethingThatDoesntMatch will cause it to backtrack all the way back to the beginning and then reparse everything with parseDifferently. So, the meta information for backtracking and the ByteStrings themselves can't be freed until the parse is completely finished.

Now, your parser (and my example above), "obviously" won't need to backtrack this way, but Attoparsec doesn't deduce this.

I can think of a couple of ways to proceed:

  1. If you are parsing megabytes instead of gigabytes, consider using Parsec, which only backtracks when it's made explicit (e.g., using try).
  2. Break up your log file into lines (or blocks of lines) using a hand-crafted, non-backtracking parser, and run your Attoparsec parser to completion on each line/block.
like image 29
K. A. Buhr Avatar answered Sep 22 '22 19:09

K. A. Buhr