Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Sub-parsers" in pipes-attoparsec

I'm trying to parse binary data using pipes-attoparsec in Haskell. The reason pipes (proxies) are involved is to interleave reading with parsing to avoid high memory use for large files. Many binary formats are based on blocks (or chunks), and their sizes are often described by a field in the file. I'm not sure what a parser for such a block is called, but that's what i mean by "sub-parser" in the title. The problem I have is to implement them in a concise way without a potentially large memory footprint. I've come up with two alternatives that each fail in some regard.

Alternative 1 is to read the block into a separate bytestring and start a separate parser for it. While concise, a large block will cause high memory use.

Alternative 2 is to keep parsing in the same context and track the number of bytes consumed. This tracking is error-prone and seems to infest all the parsers that compose into the final blockParser. For a malformed input file it could also waste time by parsing further than indicated by the size field before the tracked size can be compared.

import Control.Proxy.Attoparsec
import Control.Proxy.Trans.Either
import Data.Attoparsec as P
import Data.Attoparsec.Binary
import qualified Data.ByteString as BS

parser = do
    size <- fromIntegral <$> anyWord32le

    -- alternative 1 (ignore the Either for simplicity):
    Right result <- parseOnly blockParser <$> P.take size
    return result

    -- alternative 2
    (result, trackedSize) <- blockparser
    when (size /= trackedSize) $ fail "size mismatch"
    return result

blockParser = undefined

main = withBinaryFile "bin" ReadMode go where
    go h = fmap print . runProxy . runEitherK $ session h
    session h = printD <-< parserD parser <-< throwParsingErrors <-< parserInputD <-< readChunk h 128
    readChunk h n () = runIdentityP go where
        go = do
            c <- lift $ BS.hGet h n
            unless (BS.null c) $ respond c *> go
like image 434
absence Avatar asked Oct 04 '22 17:10

absence


1 Answers

I like to call this a "fixed-input" parser.

I can tell you how pipes-parse will do it. You can see a preview of what I'm about to describe in pipes-parse in the parseN and parseWhile functions of the library. Those are actually for generic inputs, but I wrote similar ones for example String parsers as well here and here.

The trick is really simple, you insert a fake end of input marker where you want the parser to stop, run the parser (which will fail if it hits the fake end of input marker), then remove the end of input marker.

Obviously, that's not as easy as I make it sound, but it's the general principle. The tricky parts are:

  • Doing it in such a way that it still streams. The one I linked doesn't do that, yet, but the way you do this in a streaming way is to insert a pipe upstream that counts bytes flowing through it and then inserts the end-of-input marker at the correct spot.

  • Not interfering with existing end of input markers

This trick can be adapted for pipes-attoparsec, but I think the best solution would be for attoparsec to directly include this feature. However, if that solution is not available, then we can restrict the input that is fed to the attoparsec parser.

like image 195
Gabriella Gonzalez Avatar answered Oct 09 '22 11:10

Gabriella Gonzalez