Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

context in pyparsing parse actions besides globals

I'd like to be able to parse two (or any number) of expressions, each with their own set of variable definitions or other context.

There doesn't seem to be an obvious way to associate a context with a particular invocation of pyparsing.ParseExpression.parseString(). The most natural way seems to be to use an instancemethod of some class as the parse actions. The problem with this approach is that the grammar must be redefined for each parse context (for instance, in the class's __init__), which seems terribly inefficient.

Using pyparsing.ParseExpression.copy() on the rules doesn't help; the individual expressions get cloned alright, but the sub-expressions they are composed from don't get updated in any obvious way, and so none of the parse actions of any nested expression gets invoked.

The only other way I can think of to get this effect would be to define a grammar that returns a context-less abstract parse tree and then processing it in a second step. This seems awkward even for simple grammars: it would be nice to just raise an exception the moment an unrecognized name is used, and it still won't parse languages like C which actually require context about what came before to know which rule matched.

Is there another way of injecting context (without using a global variable, of course) into the parse actions of pyparsing expressions?

like image 262
SingleNegationElimination Avatar asked Jan 01 '12 18:01

SingleNegationElimination


1 Answers

A bit late, but googling pyparsing reentrancy shows this topic, so my answer.
I've solved the issue with parser instance reusing/reentrancy by attaching the context to the string being parsed. You subclass str, put your context in an attribute of the new str class, pass an instance of it to pyparsing and get the context back in an action.

Python 2.7:

from pyparsing import LineStart, LineEnd, Word, alphas, Optional, Regex, Keyword, OneOrMore

# subclass str; note that unicode is not handled
class SpecStr(str):
    context = None  # will be set in spec_string() below
    # override as pyparsing calls str.expandtabs by default
    def expandtabs(self, tabs=8):
        ret = type(self)(super(SpecStr, self).expandtabs(tabs))
        ret.context = self.context
        return ret    

# set context here rather than in the constructor
# to avoid messing with str.__new__ and super()
def spec_string(s, context):
    ret = SpecStr(s)
    ret.context = context
    return ret    

class Actor(object):
    def __init__(self):
        self.namespace = {}

    def pair_parsed(self, instring, loc, tok):
        self.namespace[tok.key] = tok.value

    def include_parsed(self, instring, loc, tok):
        # doc = open(tok.filename.strip()).read()  # would use this line in real life
        doc = included_doc  # included_doc is defined below
        parse(doc, self)  # <<<<< recursion

def make_parser(actor_type):
    def make_action(fun):  # expects fun to be an unbound method of Actor
        def action(instring, loc, tok):
            if isinstance(instring, SpecStr):
                return fun(instring.context, instring, loc, tok)
            return None  # None as a result of parse actions means 
            # the tokens has not been changed

        return action

    # Sample grammar: a sequence of lines, 
    # each line is either 'key=value' pair or '#include filename'
    Ident = Word(alphas)
    RestOfLine = Regex('.*')
    Pair = (Ident('key') + '=' +
            RestOfLine('value')).setParseAction(make_action(actor_type.pair_parsed))
    Include = (Keyword('#include') +
               RestOfLine('filename')).setParseAction(make_action(actor_type.include_parsed))
    Line = (LineStart() + Optional(Pair | Include) + LineEnd())
    Document = OneOrMore(Line)
    return Document

Parser = make_parser(Actor)  

def parse(instring, actor=None):
    if actor is not None:
        instring = spec_string(instring, actor)
    return Parser.parseString(instring)


included_doc = 'parrot=dead'
main_doc = """\
#include included_doc
ham = None
spam = ham"""

# parsing without context is ok
print 'parsed data:', parse(main_doc)

actor = Actor()
parse(main_doc, actor)
print 'resulting namespace:', actor.namespace

yields

['#include', 'included_doc', '\n', 'ham', '=', 'None', '\n', 'spam', '=', 'ham']
{'ham': 'None', 'parrot': 'dead', 'spam': 'ham'}

This approach makes the Parser itself perfectly reusable and reentrant. The pyparsing internals are generally reentrant too, as long as you don't touch ParserElement's static fields. The only drawback is that pyparsing resets its packrat cache on each call to parseString, but this can be resolved by overriding SpecStr.__hash__ (to make it hashable like object, not str) and some monkeypatching. On my dataset this is not an issue at all as the performance hit is negligible and this even favors memory usage.

like image 139
robyschek Avatar answered Sep 17 '22 22:09

robyschek