Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing a complex logical expression in pyparsing in a binary tree fashion

I am trying to parse complex logical expression like the one below;

x > 7 AND x < 8 OR x = 4

and get the parsed string as a binary tree. For the above expression the expected parsed expression should look like

[['x', '>', 7], 'AND', [['x', '<', 8], 'OR', ['x', '=', 4]]]

'OR' logical operator has higher precedence than 'AND' operator. Parenthesis can override the default precedence. To be more general, the parsed expression should look like;

<left_expr> <logical_operator> <right_expr>

Another example would be

input_string = x > 7 AND x < 8 AND x = 4
parsed_expr  = [[['x', '>', 7], 'AND', ['x', ',', 8]], 'AND', ['x', '=', 4]]

So far i came up with this simple solution which sadly cannot generate parsed expression in binary tree fashion. operatorPrecedence doesn't seem to have help me here where there is same logical operator consecutively as in previous example.

import pyparsing as pp
complex_expr = pp.Forward()
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
logical = (pp.Keyword("AND") | pp.Keyword("OR")).setName("logical")
vars = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
condition = (vars + operator + vars)
clause = pp.Group(condition ^ (pp.Suppress("(") + complex_expr + pp.Suppress(")") ))

expr = pp.operatorPrecedence(clause,[
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ("AND", 2, pp.opAssoc.LEFT, ),])

complex_expr << expr
print complex_expr.parseString("x > 7 AND x < 8 AND x = 4")

Any suggestions or guidance is well appreciated.

BNF for the expression (without parenthesis) could be

<expr>       -> <expr> | <expr> <logical> <expr>
<expr>       -> <opnd> <relational> <opnd>
<opnd>       -> <variable> | <numeric>
<relational> -> <'>'> | <'='> | <'>='> | <'<='> | <'!='>
like image 620
consumer Avatar asked Jun 21 '12 07:06

consumer


2 Answers

NOTE: the operatorPrecedence method of pyparsing is deprecated in favor of the method name infixNotation.

Try changing:

expr = pp.operatorPrecedence(clause,[ 
                            ("OR", 2, pp.opAssoc.LEFT, ), 
                            ("AND", 2, pp.opAssoc.LEFT, ),]) 

to:

expr = pp.operatorPrecedence(condition,[ 
                            ("OR", 2, pp.opAssoc.LEFT, ), 
                            ("AND", 2, pp.opAssoc.LEFT, ),]) 

The first argument to operatorPrecedence is the primitive operand to be used with the operators - there is no need to include your complexExpr in parentheses - operatorPrecedence will do that for you. Since your operand is actually another deeper comparison, you might consider changing:

condition = (expr + operator + expr)

to:

condition = pp.Group(expr + operator + expr)

so that the output of operatorPrecedence is easier to process. With these changes, parsing x > 7 AND x < 8 OR x = 4 gives:

[[['x', '>', '7'], 'AND', [['x', '<', '8'], 'OR', ['x', '=', '4']]]]

which recognizes OR's higher precedence and groups it first. (Are you sure you want this order of AND and OR precedence? I think the traditional ordering is the reverse, as shown in this wikipedia entry.)

I think you are also asking why pyparsing and operatorPrecedence does not return the results in nested binary pairs, that is, you expect parsing "A and B and C" would return:

[['A', 'and', 'B'] 'and', 'C']

but what you get is:

['A', 'and', 'B', 'and', 'C']

That is because operatorPrecedence parses repeated operations at the same precedence level using repetition, not recursion. See this question which is very similar to yours, and whose answer includes a parse action to convert your repetitive parse tree to the more traditional binary parse tree. You can also find a sample boolean expression parser implemented using operatorPrecedence on the pyparsing wiki page.

EDIT: To clarify, this is what I recommend you reduce your parser to:

import pyparsing as pp

operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
number = pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
identifier = pp.Word(pp.alphas, pp.alphanums + "_")
comparison_term = identifier | number 
condition = pp.Group(comparison_term + operator + comparison_term)

expr = pp.operatorPrecedence(condition,[
                            ("AND", 2, pp.opAssoc.LEFT, ),
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ])

print expr.parseString("x > 7 AND x < 8 OR x = 4")

If support for NOT might also be something you want to add, then this would look like:

expr = pp.operatorPrecedence(condition,[
                            ("NOT", 1, pp.opAssoc.RIGHT, ),
                            ("AND", 2, pp.opAssoc.LEFT, ),
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ])

At some point, you may want to expand the definition of comparison_term with a more complete arithmetic expression, defined with its own operatorPrecedence definition. I would suggest doing it this way, rather than creating one monster opPrec definition, as you have already alluded to some of the performance downsides to opPrec. If you still get performance issues, look into ParserElement.enablePackrat.

like image 140
PaulMcG Avatar answered Nov 12 '22 10:11

PaulMcG


Let me suggest this parsing approach, coming directly from Peter Norvig's class in design of computer programs at udacity (and tweaked for your needs).

from functools import update_wrapper
from string import split
import re

def grammar(description, whitespace=r'\s*'):
    """Convert a description to a grammar.  Each line is a rule for a
    non-terminal symbol; it looks like this:
        Symbol =>  A1 A2 ... | B1 B2 ... | C1 C2 ...
    where the right-hand side is one or more alternatives, separated by
    the '|' sign.  Each alternative is a sequence of atoms, separated by
    spaces.  An atom is either a symbol on some left-hand side, or it is
    a regular expression that will be passed to re.match to match a token.

    Notation for *, +, or ? not allowed in a rule alternative (but ok
    within a token). Use '\' to continue long lines.  You must include spaces
    or tabs around '=>' and '|'. That's within the grammar description itself.
    The grammar that gets defined allows whitespace between tokens by default;
    specify '' as the second argument to grammar() to disallow this (or supply
    any regular expression to describe allowable whitespace between tokens)."""
    G = {' ': whitespace}
    description = description.replace('\t', ' ') # no tabs!
    for line in split(description, '\n'):
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

def decorator(d):
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

@decorator
def memo(f):
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # some element of args can't be a dict key
            return f(args)
    return _f

def parse(start_symbol, text, grammar):
    """Example call: parse('Exp', '3*x + b', G).
    Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
    string. Failure iff remainder is None. This is a deterministic PEG parser,
    so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
    longest parse first; don't do 'E => T | T op E'
    Also, no left recursion allowed: don't do 'E => E op T'"""

    tokenizer = grammar[' '] + '(%s)'

    def parse_sequence(sequence, text):
        result = []
        for atom in sequence:
            tree, text = parse_atom(atom, text)
            if text is None: return Fail
            result.append(tree)
        return result, text

    @memo
    def parse_atom(atom, text):
        if atom in grammar:  # Non-Terminal: tuple of alternatives
            for alternative in grammar[atom]:
                tree, rem = parse_sequence(alternative, text)
                if rem is not None: return [atom]+tree, rem  
            return Fail
        else:  # Terminal: match characters against start of text
            m = re.match(tokenizer % atom, text)
            return Fail if (not m) else (m.group(1), text[m.end():])

    # Body of parse:
    return parse_atom(start_symbol, text)

Fail = (None, None)

MyLang = grammar("""expression => block logicalop expression | block
block => variable operator number
variable => [a-z]+
operator => <=|>=|>|<|=
number => [-+]?[0-9]+
logicalop => AND|OR""", whitespace='\s*')

def parse_it(text):
    return parse('expression', text, MyLang)

print parse_it("x > 7 AND x < 8 AND x = 4")

Outputs:

(['expression', ['block', ['variable', 'x'], ['operator', '>'], ['number', '7']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '<'], ['number', '8']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '='], ['number', '4']]]]], '')
like image 31
luke14free Avatar answered Nov 12 '22 09:11

luke14free