Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the algorithm for parsing expressions in infix notation?

I would like to parse boolean expressions in PHP. As in:

A and B or C and (D or F or not G)

The terms can be considered simple identifiers. They will have a little structure, but the parser doesn't need to worry about that. It should just recognize the keywords and or not ( ). Everything else is a term.

I remember we wrote simple arithmetic expression evaluators at school, but I don't remember how it was done anymore. Nor do I know what keywords to look for in Google/SO.

A ready made library would be nice, but as I remember the algorithm was pretty simple so it might be fun and educational to re-implement it myself.

like image 493
Vilx- Avatar asked Jan 19 '10 11:01

Vilx-


1 Answers

Recursive descent parsers are fun to write and easy to read. The first step is to write your grammar out.

Maybe this is the grammar you want.

expr        = and_expr ('or' and_expr)*
and_expr    = not_expr ('and' not_expr)*
not_expr    = simple_expr | 'not' not_expr
simple_expr = term | '(' expr ')'

Turning this into a recursive descent parser is super easy. Just write one function per nonterminal.

def expr():
    x = and_expr()
    while peek() == 'or':
        consume('or')
        y = and_expr()
        x = OR(x, y)
    return x

def and_expr():
    x = not_expr()
    while peek() == 'and':
        consume('and')
        y = not_expr()
        x = AND(x, y)
    return x

def not_expr():
    if peek() == 'not':
        consume('not')
        x = not_expr()
        return NOT(x)
    else:
        return simple_expr()

def simple_expr():
    t = peek()
    if t == '(':
        consume('(')
        result = expr()
        consume(')')
        return result
    elif is_term(t):
        consume(t)
        return TERM(t)
    else:
        raise SyntaxError("expected term or (")

This isn't complete. You have to provide a little more code:

  • Input functions. consume, peek, and is_term are functions you provide. They'll be easy to implement using regular expressions. consume(s) reads the next token of input and throws an error if it doesn't match s. peek() simply returns a peek at the next token without consuming it. is_term(s) returns true if s is a term.

  • Output functions. OR, AND, NOT, and TERM are called each time a piece of the expression is successfully parsed. They can do whatever you want.

  • Wrapper function. Instead of just calling expr directly, you'll want to write a little wrapper function that initializes the variables used by consume and peek, then calls expr, and finally checks to make sure there's no leftover input that didn't get consumed.

Even with all this, it's still a tiny amount of code. In Python, the complete program is 84 lines, and that includes a few tests.

like image 120
Jason Orendorff Avatar answered Oct 01 '22 13:10

Jason Orendorff