Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hand-written top-down parser that preserves grammar associativity/precedence?

Tags:

parsing

Suppose I have a simple grammar like:

X -> number
T -> X
T -> T + X

So for example 3 + 4 + 5 would parse as:

     +
    / \ 
   +   5
 /  \
3    4

This has the left-right associativity of + "built into" the grammar.

It is trivially LR(1), however suppose I want to do a hand-written top-down parse of it.

I cannot do so because it is left recursive, so lets left factor it:

X  -> number
T  -> X T'
T' -> + X T'
T' -> e    // empty

If I now write a parser for it (psuedo-code):

parse_X:
    if lookahead is number
        return pop_lookahead

parse_T:
    return (parse_X, parse_T')

parse_T':
    if lookahead is +
         pop_lookahead
         return (parse_X, parse_T')
    else
         return ();

Then when I call parse_T on an input of 3 + 4 + 5 I get returned a trace like:

parse_T
(parse_X, parse_T')
(3, parse_T')
(3, (parse_X, parse_T'))
(3, (4, parse_T'))
(3, (4, (parse_X, parse_T')))
(3, (4, (5, ())))

See how the parse is "backwards". A tree constructed "naively" from such a parse looks like:

     +
    / \ 
   3   +
      / \
     4    5

Which has the wrong associativity.

Can anyone clear this up? In general how can I write a hand-written top-down parser that preserves the associativity built into the grammar?

like image 345
Andrew Tomazos Avatar asked Dec 07 '25 09:12

Andrew Tomazos


1 Answers

One strategy is to replace recursion over T with iteration over X (in general terms, replace recursion over an operator with iteration over the next-highest-precedence operator). It helps to use EBNF-style notation, in this case

T -> X {+ X}

because then the needed iteration becomes obvious:

parse_T:
  val = parse_X
  while lookahead is +
     pop_lookahead
     val = PLUS(val, parse_X)
  return val

where PLUS() represents whatever you do to evaluate an addition expression, e.g. construct a tree node.

If you apply this to all operators, you wind up with essentially one function corresponding to EXPRESSION which only recurses when dealing with

PRIMARY -> '(' EXPRESSION ')'

Such an approach leads to a fairly fast expression parser; one of the common objections to using naive recursive descent to parse expressions is that if you have several levels of operator precedence, you can easily need 20 or so nested function calls to parse each PRIMARY, which can get fairly slow. With the iterative approach, it generally takes only one function call per PRIMARY. If you have some right-associative operators (e.g. exponentiation), you can use a recursive approach for them.

like image 85
ebohlman Avatar answered Dec 11 '25 23:12

ebohlman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!