Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the simplest way to hand-parse left recursive grammars?

I am writing a parser for a pet project, and for educational purposes, would like to do it by hand instead of using a parser generator. Unfortunately, many online resources (and the compiler course I took at university) assume some sort of LL(k) grammar. I don't want to left factor the grammar though since left-recursive non-terminals such as

E ::= E * E
    | E / E
    | E + E
    | E - E
    | ...

that are possible in bison-like parser generators seem to greatly simplify the grammar.

What would be the simplest way to hand-parse such a grammar? My research so far tells me that recursive descent is not an option for reasons explained here, and that LR parsing using recursive ascent may be a good choice, though several sites pause to mention that it is non-intuitive.

like image 672
Matt Kline Avatar asked Oct 04 '22 10:10

Matt Kline


1 Answers

If you want to build a mostly recursive descent parser for a toy language whose only left recursion is in more-or-less standard expression syntax, then you should consider a Pratt parser (Java) (Javascript). Alternatively (and equivalently, as it turns out), you can use an operator precedence parser; the algorithm is well described in the Dragon book and there are lots of available examples using the shunting yard algorithm (but beware; many people who enthusiastically discover this algorithm and immediately write it up for their blog are far from reliable authorities.)


Note for loose parsing:

If you want to parse an expression grammar, and you don't care about operator precedence (for example, if you only need to syntax colour the code), you can easily reframe the expression grammar to avoid left-recursion.

The starting point is this, using * for the Kleene star, ? for optional, and ( ) for grouping:

expression: term (INFIX term)*
term: PREFIX* operand postfix*
operand: ID
       | CONSTANT 
       | '(' expression ')'
       ;
postfix: POSTFIX
       | '[' expression ']'
       | '(' ( expression (',' expression)* )? ')' 
       ;

Recursive descent parsers can easily deal with the * and ? operators in the above, and there is no left-recursion, so that should suffice.

Note that C has the awkwardness of cast expressions, which are syntactically indistinguishable from function calls unless you know that the first parenthesized expression contains a type instead of an expression. However, for loose parsing purposes, it's fine to parse them as though they were function calls, using the above grammar.

C++'s use of angle brackets both as operators and as template parameters is harder to deal with. Many syntax colourers I've seen just ignore the template parameter case altogether; getting them right is a challenge, but might be worthwhile.


Editorial comment:

Ignore this if you like, but I personally think that learning to work with bottom-up LR parsers, particularly GLR parsers, is a far more enriching educational experience than trying to shoehorn a language into a restricted parsing algorithm, particularly one in which the details of the grammar are obscured as code paths. But having said that, it may be valuable to do both a bison and hand-generated parser, if only to see how much more elegant and simple the bison parser can be.

like image 175
rici Avatar answered Oct 06 '22 01:10

rici