I'm a quite beginner of menhir. I'm wondering how to parse OCaml like tuple-pattern in my own language, which is quite similar to OCaml.
For example, in the expression let a,b,c = ...
,
a, b, c
should be parsed like Tuple (Var "a", Var "b", Var "c")
.
But, in following definition of parser, the above example is parsed as Tuple (Tuple (Var "a", Var "b"), Var "c")
.
I am wondering how to fix the following definition to parse patterns like ocaml.
I have checked OCaml's parser.mly, but I'm not sure how to implement to do that. I think my definition is similar to OCaml's one ... What a magic they use ?
%token LPAREN
%token RPAREN
%token EOF
%token COMMA
%left COMMA
%token <string> LIDENT
%token UNDERBAR
%nonassoc below_COMMA
%start <Token.token> toplevel
%%
toplevel:
| p = pattern EOF { p }
pattern:
| p = simple_pattern { p }
| psec = pattern_tuple %prec below_COMMA
{ Ppat_tuple (List.rev psec) }
simple_pattern:
| UNDERBAR { Ppat_any }
| LPAREN RPAREN { Ppat_unit }
| v = lident { Ppat_var v }
| LPAREN p = pattern RPAREN { p }
pattern_tuple:
| seq = pattern_tuple; COMMA; p = pattern { p :: seq }
| p1 = pattern; COMMA; p2 = pattern { [p2; p1] }
lident:
| l = LIDENT { Pident l }
The result is the following:
[~/ocaml/error] menhir --interpret --interpret-show-cst ./parser.mly
File "./parser.mly", line 27, characters 2-42:
Warning: production pattern_tuple -> pattern_tuple COMMA pattern is never reduced.
Warning: in total, 1 productions are never reduced.
LIDENT COMMA LIDENT COMMA LIDENT
ACCEPT
[toplevel:
[pattern:
[pattern_tuple:
[pattern:
[pattern_tuple:
[pattern: [simple_pattern: [lident: LIDENT]]]
COMMA
[pattern: [simple_pattern: [lident: LIDENT]]]
]
]
COMMA
[pattern: [simple_pattern: [lident: LIDENT]]]
]
]
EOF
]
It contains a typical shift-reduce conflict, and you made a mistake at its resolution by specifying precedence. Please open any book about parsing by Yacc and check about shift-reduce conflicts and their resolution.
Let's see it using your rules. Suppose we have the following input and the parser is looking ahead at the second ,
:
( p1 , p2 , ...
↑
Yacc is looking at this second COMMA token
Their are two possibilities:
p1 , p2
as a pattern
. It uses the second rule of pattern
COMMA
and shift the look ahead cursor right to try making the comma separated list longer.You can see the conflict easily removing %prec below_COMMA
from the pattern
rule:
$ menhir z.mly # the %prec thing is removed
...
File "z.mly", line 4, characters 0-9:
Warning: the precedence level assigned to below_COMMA is never useful.
Warning: one state has shift/reduce conflicts.
Warning: one shift/reduce conflict was arbitrarily resolved.
Many Yacc documents say in this case Yacc prefers shift, and this default usually matches with human intention, including your case. So one of the solutions is simply drop %prec below_COMMA
thing and forget the warning.
If you do not like having shift reduce conflict warnings (That's the spirit!), you can explicitly state which rule should be chosen in this case using precedences, just like OCaml's parser.mly
does. (BTW, OCaml's parser.mly
is a jewel box of shift reduce resolutions. If you are not familiar, you should check one or two of them.)
Yacc chooses the rule with higher precedence at a shift reduce conflict. For shift, its precedence is the one of the token at the look ahead cursor, which is COMMA
in this case. The precedence of reduce can be declared by %prec TOKEN
suffix at the corresponding rule. If you do not specify it, I guess the precedence of a rule is undefined, and this is why the shift reduce conflict is warned if you remove %prec below_COMMA
.
So now the question is: which has a higher precedence, COMMA
or below_COMMA
? This should be declared in the preamble of the mly
file. (And this is why I have asked the questioner show that part.)
...
%left COMMA
...
%nonassoc below_COMMA
I skip what %left
and %nonassoc
mean, for all the Yacc books should explain them. Here below_COMMA
pseudo token is below COMMA
. It means below_COMMA
has a higher precedence than COMMA
. Therefore, the above example chooses Reduce, and gets ( (p1, p2), ...
against the intention.
The correct precedence declaration is the opposite. To let Shift happen, below_COMMA
must come above of COMMA
, to have a lower precedence:
...
%nonassoc below_COMMA
%left COMMA
...
See OCaml's parser.mly
. It exactly does like this. Putting "below thing above" sounds totally crazy, but it is not Menhir's fault. It is Yacc's unfortunate tradition. Blame it. OCaml's parser.mly
has a comment already:
The precedences must be listed from low to high.
This is how I would rewrite it:
%token LPAREN %token RPAREN %token EOF %token COMMA %token LIDENT %token UNDERBAR %start toplevel %% toplevel: | p = pattern EOF { p } pattern: | p = simple_pattern; tail = pattern_tuple_tail { match tail with | [] -> p | _ -> Ppat_tuple (p :: tail) } pattern_tuple_tail: | COMMA; p = simple_pattern; seq = pattern_tuple_tail { p :: seq } | { [] } simple_pattern: | UNDERBAR { Ppat_any } | LPAREN RPAREN { Ppat_unit } | v = lident { Ppat_var v } | LPAREN p = pattern RPAREN { p } lident: | l = LIDENT { Pident l }
A major change is that a tuple element is no longer a pattern
but a simple_pattern
. A pattern
itself is a comma-separated sequence of one or more elements.
$ menhir --interpret --interpret-show-cst parser.mly LIDENT COMMA LIDENT COMMA LIDENT ACCEPT [toplevel: [pattern: [simple_pattern: [lident: LIDENT]] [pattern_tuple_tail: COMMA [simple_pattern: [lident: LIDENT]] [pattern_tuple_tail: COMMA [simple_pattern: [lident: LIDENT]] [pattern_tuple_tail:] ] ] ] EOF ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With