Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lrparsing module; cannot parse simple recursive grammar

expr = Ref('expr')
block = '{' + Repeat(expr) + '}'
expr = block | Token(re='[0-9]')
START = expr

Here's a grammar using the lrparsing module for Python. The module reports no conflicts in the grammar.

It fails to parse the string {{0}} with the error lrparsing.ParseError: line 1 column 5: Got '}' when expecting __end_of_input__ while trying to match block in state 11

The stack state step by step is:

shift  '{'; ItemSet:5='{'
shift  '{'; ItemSet:5='{' ItemSet:5='{'
shift  /[0-9]/; ItemSet:4=/[0-9]/ ItemSet:5='{' ItemSet:5='{'
reduce '}'; ItemSet:4=/[0-9]/ -- ItemSet:7=expr ItemSet:5='{' ItemSet:5='{'
reduce '}'; ItemSet:7=expr -- ItemSet:9=expr ItemSet:5='{' ItemSet:5='{'
shift  '}'; ItemSet:11='}' ItemSet:9=expr ItemSet:5='{' ItemSet:5='{'

Which afaik means it's shifting {{0, then upon seeing } reducing 0 to expr, then reducing on } again without having shifted it first, which confuses the bajeezus out of me.

Is this a bug, or am I doing something infinitely simple and stupid?

If it's my grammar, how would I refactor it to satisfy my eternally steamy and passionate desires? If it's a bug, could someone direct me to a python module that has syntax most similar to lrparsing that does work?

EDIT: Refactoring as:

blocks = Ref('blocks')
block = Ref('block')
expr = Ref('expr')
blocks = blocks + block | THIS*0 # THIS*0 is the idiomatic way of getting the empty string
block = '{' + expr + '}'
expr = blocks | Token(re='[0-9]')
START = expr

Allows for proper parsing. The question for me now, is... why? I feel like lrparsing would've complained to me earlier about any parsing problems... Does Repeat simply not work the way I expect it to?

like image 681
user Avatar asked Dec 04 '25 22:12

user


1 Answers

Lrparsing author here. As Serge said it was a bug, and is fixed in 1.0.8. This only happened because Serge reported it on Source Forge's but tracker - otherwise I would not have known. Thank you Serge.

The comments about it possibly being a bug in Repeat() hints at not understanding what lrparsing does. Lrparsing is a rather complex beast. It lets you enter grammar in a way I hope is natural for a Python programmer. It then compiles into something a LR(1) parser generator can understand, which is a series of productions. Then it generates an LR(1) parsing table from those productions. And finally it feeds your input language and the parsing table to an LR(1) parser to generate the parse tree. For what it's worth, the bug was in the part that generates the parsing table.

Debugging such a series of transformations would be near impossible for me I if I could not see what each step produces. Accordingly lrparsing has a repr_xxxx() function that displays the output of each step. The first transformation is parsing your grammar. The result is displayed by repr_grammar():

<G> = START + __end_of_input__
START = expr
block = '{' + expr * () + '}'
expr = block | /[0-9]/

Which looks very similar to the original grammar presented in the question. The next step is to compile those rules in productions, which is what an LR(1) parser generator can understand. These are printed by repr_productions():

<G> = START __end_of_input__
START = expr
block = '{' '}'
block = '{' block.Sequence.Repeat '}'
block.Sequence.Repeat = expr
block.Sequence.Repeat = block.Sequence.Repeat expr
expr = block
expr = /[0-9]/

The block.Sequence.Repeat is a new Nonterminal lrparsing introduced in order to handle the Repeat(). Those productions look like a faithful representation of the original grammar to me.

Lrparsing goes out of it's way to hide the nonterminals it introduces like block.Sequence.Repeat. For example they won't appear in the output parse tree. That means there is no need for an lrparsing user to care about them - except for 2 cases. Those 2 cases are error recovery and trying to understand the log output of the parse engine. The former is a complex technique most won't attempt. But some here looked at the latter in order to try and understand what lrparsing was doing. The log won't make much sense unless you can see the productions the LR(1) parser is trying to recognise. But if you had seen them, you would have known there wasn't a bug in Repeat().

You can also dump the generated LR(1) parse table. If you really want to understand how a LR(1) parser works, that is what you should be trying to grok. Unless you happen to find parsing a deeply interesting topic I don't recommend it.

like image 150
Russell Stuart Avatar answered Dec 06 '25 11:12

Russell Stuart



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!