Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental but complete parsing with PyParsing?

I'm using PyParsing to parse some rather large text files with a C-like format (braces and semicolons and all that).

PyParsing works just great, but it is slow and consumes a very large amount of memory due to the size of my files.

Because of this, I wanted to try to implement an incremental parsing approach wherein I'd parse the top-level elements of the source file one-by-one. The scanString method of pyparsing seems like the obvious way to do this. However, I want to make sure that there is no invalid/unparseable text in-between the sections parsed by scanString, and can't figure out a good way to do this.

Here's a simplified example that shows the problem I'm having:

sample="""f1(1,2,3); f2_no_args( );
# comment out: foo(4,5,6);
bar(7,8);
this should be an error;
baz(9,10);
"""

from pyparsing import *

COMMENT=Suppress('#' + restOfLine())
SEMI,COMMA,LPAREN,RPAREN = map(Suppress,';,()')

ident = Word(alphas, alphanums+"_")
integer = Word(nums+"+-",nums)

statement = ident("fn") + LPAREN + Group(Optional(delimitedList(integer)))("arguments") + RPAREN + SEMI

p = statement.ignore(COMMENT)

for res, start, end in p.scanString(sample):
    print "***** (%d,%d)" % (start, end)
    print res.dump()

Output:

***** (0,10)
['f1', ['1', '2', '3']]
- arguments: ['1', '2', '3']
- fn: f1
***** (11,25)
['f2_no_args', []]
- arguments: []
- fn: f2_no_args
***** (53,62)
['bar', ['7', '8']]
- arguments: ['7', '8']
- fn: bar
***** (88,98)
['baz', ['9', '10']]
- arguments: ['9', '10']
- fn: baz

The ranges returned by scanString have gaps due to unparsed text between them ((0,10),(11,25),(53,62),(88,98)). Two of these gaps are whitespace or comments, which should not trigger an error, but one of them (this should be an error;) contains unparsable text, which I want to catch.

Is there a way to use pyparsing to parse a file incrementally while still ensuring that the entire input could be parsed with the specified parser grammar?

like image 743
Dan Lenski Avatar asked Oct 27 '14 15:10

Dan Lenski


1 Answers

I came up with what seems to be a pretty decent solution after a brief discussion on the PyParsing users' mailing list.

I modified the ParserElement.parseString method slightly to come up with parseConsumeString, which does about what I want. This version calls ParserElement._parse followed by ParserElement.preParse repeatedly.

Here is code to monkey-patch ParserElement with the parseConsumeString method:

from pyparsing import ParseBaseException, ParserElement

def parseConsumeString(self, instring, parseAll=True, yieldLoc=False):
    '''Generator version of parseString which does not try to parse
    the whole string at once.

    Should be called with a top-level parser that could parse the
    entire string if called repeatedly on the remaining pieces.
    Instead of:

        ZeroOrMore(TopLevel)).parseString(s ...)

    Use:

        TopLevel.parseConsumeString(s ...)

    If yieldLoc==True, it will yield a tuple of (tokens, startloc, endloc).
    If False, it will yield only tokens (like parseString).

    If parseAll==True, it will raise an error as soon as a parse
    error is encountered. If False, it will return as soon as a parse
    error is encountered (possibly before yielding any tokens).'''

    if not self.streamlined:
        self.streamline()
        #~ self.saveAsList = True
    for e in self.ignoreExprs:
        e.streamline()
    if not self.keepTabs:
        instring = instring.expandtabs()
    try:
        sloc = loc = 0
        while loc<len(instring):
            # keeping the cache (if in use) across loop iterations wastes memory (can't backtrack outside of loop)
            ParserElement.resetCache()
            loc, tokens = self._parse(instring, loc)
            if yieldLoc:
                yield tokens, sloc, loc
            else:
                yield tokens
            sloc = loc = self.preParse(instring, loc)
    except ParseBaseException as exc:
        if not parseAll:
            return
        elif ParserElement.verbose_stacktrace:
            raise
        else:
            # catch and re-raise exception from here, clears out pyparsing internal stack trace
            raise exc

def monkey_patch():
    ParserElement.parseConsumeString = parseConsumeString

Notice that I also moved the call to ParserElement.resetCache into each loop iteration. Because it's impossible to backtrack out of each loop, there's no need to retain the cache across iterations. This drastically reduce memory consumption when using PyParsing's packrat caching feature. In my tests with a 10 MiB input file, peak memory consumption goes down from ~6G to ~100M peak, while running about 15-20% faster.

like image 114
Dan Lenski Avatar answered Oct 16 '22 17:10

Dan Lenski