I am trying to parse complex logical expression like the one below;
x > 7 AND x < 8 OR x = 4
and get the parsed string as a binary tree. For the above expression the expected parsed expression should look like
[['x', '>', 7], 'AND', [['x', '<', 8], 'OR', ['x', '=', 4]]]
'OR' logical operator has higher precedence than 'AND' operator. Parenthesis can override the default precedence. To be more general, the parsed expression should look like;
<left_expr> <logical_operator> <right_expr>
Another example would be
input_string = x > 7 AND x < 8 AND x = 4
parsed_expr = [[['x', '>', 7], 'AND', ['x', ',', 8]], 'AND', ['x', '=', 4]]
So far i came up with this simple solution which sadly cannot generate parsed expression in binary tree fashion. operatorPrecedence doesn't seem to have help me here where there is same logical operator consecutively as in previous example.
import pyparsing as pp
complex_expr = pp.Forward()
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
logical = (pp.Keyword("AND") | pp.Keyword("OR")).setName("logical")
vars = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
condition = (vars + operator + vars)
clause = pp.Group(condition ^ (pp.Suppress("(") + complex_expr + pp.Suppress(")") ))
expr = pp.operatorPrecedence(clause,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
complex_expr << expr
print complex_expr.parseString("x > 7 AND x < 8 AND x = 4")
Any suggestions or guidance is well appreciated.
BNF
for the expression (without parenthesis) could be
<expr> -> <expr> | <expr> <logical> <expr>
<expr> -> <opnd> <relational> <opnd>
<opnd> -> <variable> | <numeric>
<relational> -> <'>'> | <'='> | <'>='> | <'<='> | <'!='>
NOTE: the operatorPrecedence
method of pyparsing is deprecated in favor of
the method name infixNotation
.
Try changing:
expr = pp.operatorPrecedence(clause,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
to:
expr = pp.operatorPrecedence(condition,[
("OR", 2, pp.opAssoc.LEFT, ),
("AND", 2, pp.opAssoc.LEFT, ),])
The first argument to operatorPrecedence is the primitive operand to be used with the operators - there is no need to include your complexExpr in parentheses - operatorPrecedence will do that for you. Since your operand is actually another deeper comparison, you might consider changing:
condition = (expr + operator + expr)
to:
condition = pp.Group(expr + operator + expr)
so that the output of operatorPrecedence is easier to process. With these changes, parsing x > 7 AND x < 8 OR x = 4
gives:
[[['x', '>', '7'], 'AND', [['x', '<', '8'], 'OR', ['x', '=', '4']]]]
which recognizes OR's higher precedence and groups it first. (Are you sure you want this order of AND and OR precedence? I think the traditional ordering is the reverse, as shown in this wikipedia entry.)
I think you are also asking why pyparsing and operatorPrecedence does not return the results in nested binary pairs, that is, you expect parsing "A and B and C" would return:
[['A', 'and', 'B'] 'and', 'C']
but what you get is:
['A', 'and', 'B', 'and', 'C']
That is because operatorPrecedence parses repeated operations at the same precedence level using repetition, not recursion. See this question which is very similar to yours, and whose answer includes a parse action to convert your repetitive parse tree to the more traditional binary parse tree. You can also find a sample boolean expression parser implemented using operatorPrecedence on the pyparsing wiki page.
EDIT: To clarify, this is what I recommend you reduce your parser to:
import pyparsing as pp
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
number = pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
identifier = pp.Word(pp.alphas, pp.alphanums + "_")
comparison_term = identifier | number
condition = pp.Group(comparison_term + operator + comparison_term)
expr = pp.operatorPrecedence(condition,[
("AND", 2, pp.opAssoc.LEFT, ),
("OR", 2, pp.opAssoc.LEFT, ),
])
print expr.parseString("x > 7 AND x < 8 OR x = 4")
If support for NOT might also be something you want to add, then this would look like:
expr = pp.operatorPrecedence(condition,[
("NOT", 1, pp.opAssoc.RIGHT, ),
("AND", 2, pp.opAssoc.LEFT, ),
("OR", 2, pp.opAssoc.LEFT, ),
])
At some point, you may want to expand the definition of comparison_term
with a more complete arithmetic expression, defined with its own operatorPrecedence
definition. I would suggest doing it this way, rather than creating one monster opPrec
definition, as you have already alluded to some of the performance downsides to opPrec
. If you still get performance issues, look into ParserElement.enablePackrat
.
Let me suggest this parsing approach, coming directly from Peter Norvig's class in design of computer programs at udacity (and tweaked for your needs).
from functools import update_wrapper
from string import split
import re
def grammar(description, whitespace=r'\s*'):
"""Convert a description to a grammar. Each line is a rule for a
non-terminal symbol; it looks like this:
Symbol => A1 A2 ... | B1 B2 ... | C1 C2 ...
where the right-hand side is one or more alternatives, separated by
the '|' sign. Each alternative is a sequence of atoms, separated by
spaces. An atom is either a symbol on some left-hand side, or it is
a regular expression that will be passed to re.match to match a token.
Notation for *, +, or ? not allowed in a rule alternative (but ok
within a token). Use '\' to continue long lines. You must include spaces
or tabs around '=>' and '|'. That's within the grammar description itself.
The grammar that gets defined allows whitespace between tokens by default;
specify '' as the second argument to grammar() to disallow this (or supply
any regular expression to describe allowable whitespace between tokens)."""
G = {' ': whitespace}
description = description.replace('\t', ' ') # no tabs!
for line in split(description, '\n'):
lhs, rhs = split(line, ' => ', 1)
alternatives = split(rhs, ' | ')
G[lhs] = tuple(map(split, alternatives))
return G
def decorator(d):
def _d(fn):
return update_wrapper(d(fn), fn)
update_wrapper(_d, d)
return _d
@decorator
def memo(f):
cache = {}
def _f(*args):
try:
return cache[args]
except KeyError:
cache[args] = result = f(*args)
return result
except TypeError:
# some element of args can't be a dict key
return f(args)
return _f
def parse(start_symbol, text, grammar):
"""Example call: parse('Exp', '3*x + b', G).
Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
string. Failure iff remainder is None. This is a deterministic PEG parser,
so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
longest parse first; don't do 'E => T | T op E'
Also, no left recursion allowed: don't do 'E => E op T'"""
tokenizer = grammar[' '] + '(%s)'
def parse_sequence(sequence, text):
result = []
for atom in sequence:
tree, text = parse_atom(atom, text)
if text is None: return Fail
result.append(tree)
return result, text
@memo
def parse_atom(atom, text):
if atom in grammar: # Non-Terminal: tuple of alternatives
for alternative in grammar[atom]:
tree, rem = parse_sequence(alternative, text)
if rem is not None: return [atom]+tree, rem
return Fail
else: # Terminal: match characters against start of text
m = re.match(tokenizer % atom, text)
return Fail if (not m) else (m.group(1), text[m.end():])
# Body of parse:
return parse_atom(start_symbol, text)
Fail = (None, None)
MyLang = grammar("""expression => block logicalop expression | block
block => variable operator number
variable => [a-z]+
operator => <=|>=|>|<|=
number => [-+]?[0-9]+
logicalop => AND|OR""", whitespace='\s*')
def parse_it(text):
return parse('expression', text, MyLang)
print parse_it("x > 7 AND x < 8 AND x = 4")
Outputs:
(['expression', ['block', ['variable', 'x'], ['operator', '>'], ['number', '7']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '<'], ['number', '8']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '='], ['number', '4']]]]], '')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With