I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3
. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M
to A
, as doing so would result in an inevitable syntax error after reading *3
?
Rule 1. If there is a shift-reduce conflict in situations where no precedence rules have been created to resolve the conflict, the default action is to shift. The conflict is also reported in the yacc output so you can check that shifting is actually what you want.
The Shift-Reduce Conflict is the most common type of conflict found in grammars. It is caused when the grammar allows a rule to be reduced for particular token, but, at the same time, allowing another rule to be shifted for that same token.
Since choices are made only at the end of a rule, shared prefixes are unproblematic. Because of this, there is usually no need to modify grammar rules. The parser can be generated automatically. One big disadvantage is the fact that bottom-up parsing does not suppport left/right information flow.
A reduce/reduce conflict occurs if there are two or more rules that apply to the same sequence of input. This usually indicates a serious error in the grammar. For example, here is an erroneous attempt to define a sequence of zero or more word groupings.
The problem you're describing is an issue with creating LR(0)
parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0)
grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1)
, however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1)
parser would look ahead when it had the 1
on the stack, see that the next symbol is a +
, and realize that it shouldn't reduce past A
(since that is the only thing it could reduce to that would still match a rule with +
in the second position).
An interesting property of LR
grammars is that for any grammar which is LR(k)
for k>1
, it is possible to construct an LR(1)
grammar which is equivalent. However, the same does not extend all the way down to LR(0)
- there are many grammars which cannot be converted to LR(0)
.
See here for more details on LR(k)
-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With