I've defined the following minimal Peg.js grammar:
start = "A1" / "A123"
which you can try in the sandbox.
I would have expected to match "A1" as well as "A123" (according to my notion of how backtracking works). But this is not the case: the grammar recognizes "A1" but not "A123".
Note: I am not looking for the advice "reverse the order of your terms" as in the related question How to transform a simple grammar into something which works in PEG.js (expected "a" but "a" found). Rather, I'm looking to understand the behavior I'm seeing, and why Peg.js's backtracking doesn't apply in this case. For an explanation of why reversing the order of my terms doesn't help, see the more realistic example below.
For a more realistic example, consider units parsing. A grammar should recognize metric units (like "m", "mol") with optional prefixes, like "mm", "mmol", as well as non-metric units like "yr", "week", or "mo".
The following Peg.js grammar won't recognize "mol" because it gets tripped up consuming "mo", and doesn't backtrack. (Changing the order of terms doesn't help; or rather, will cause "mo" to be recognized at the expense of "mol" or "mmol".)
start = nonmetric / metric / prefix metric
metric = "mol" / "l" / "m" / "g"
nonmetric = "yr" / "mo" / "week" / "day" / "hour"
prefix = "m" / "k" / "c"
I can do the analagous thing in Antlr with good success:
grammar units;
start : nonmetric | metric | prefix metric;
metric : 'mol' | 'l' | 'm' | 'g';
nonmetric : 'yr' | 'mo' | 'week' | 'day' | 'hour';
prefix : 'm' | 'k' | 'c';
The problem is with the concept of backtracking. PEG parsers don't backtrack like other recursive-descent parsers or Prolog do. Rather, when confronted with a choice, a PEG parser will try every option until one succeeds. Once one succeeds, it will commit to it no matter how the rule was invoked.
From the Wikipedia article:
Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking.
What you ask for in the complex case is the same that's asked in this question. The answer so far is Yes: you must tweak the rules in PEG grammars to make sure that the longest option always gets matched first, even if the result is a somewhat uglier grammar.
One way to tweak PEG grammars is to use lookaheads (that's one of the main reasons why lookaheads are featured in PEG):
start = nonmetric / metric / prefix metric
metric = "mol" / "l" / !"mo" "m" / "g"
nonmetric = "yr" / !"mol" "mo" / "week" / "day" / "hour"
prefix = !("mol"/"mo") "m" / "k" / "c"
This is by design. It's up to you to specify right order or rules that will be used for matching.
The quote from the original white paper:
These tools do not make language syntax design easy, of course. In place of having to determine whether two possible alternatives in a CFG are ambiguous, PEGs present language designers with the analogous challenge of determining whether two alternatives in a ‘/’ expression can be reordered without affecting the language. This question is often obvious, but sometimes is not, and is undecid- able in general. As with discovering ambiguity in CFGs, however, we have the hope of finding automatic algorithms to identify order sensitivity or insensitivity conservatively in common situations.
In this simple case the PEG.js could be a little smarter and recognize that rules you specified are ambiguous. Probably worth to ask the author.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With