I try to learn about language parser recently and always seen the review about difference in Yacc and Antlr (about LALR and LL). It was always some concluded wording like "LALR is more powerful". But I can't understand what its really means
So could anyone please enlighten me what is the meaning of the word powerful here?
I just assume that it would be meant "Yacc can do something Antlr can't do", if it is I wish I could see the exact example about it
The question Language theoretic comparison of LL and LR grammars has an answer with the following language that is LR(1) but not LL(*):
{ a^i b^j | i≥j }
That is, some number of a
followed by an equal or lesser number of b
.
The same language is cited by this answer to the similar question Example of an LR grammar that cannot be represented by LL?. However, the present question is different because that one says "LL", meaning LL(k), whereas here we're asking about LL(*) (and Antlr4).
Let's intuitively argue that this is LR(1) but not LL(*).
First, the LR(1) grammar (copied from the second linked answer):
S ::= a S | P
P ::= a P b | <empty>
Intuitively, this is LR(1) because an LR(1) parser can push any number of a
symbols onto its stack, then when it gets to the first b
, begin popping the corresponding a
symbols in a,b
pairs, using the first production for P
. If it runs out of b
symbols, it pops the remaining a
symbols using the first production for S
. If it runs out of a
symbols while there are still b
symbols left, it signals an error. (Remember, in this context, we're primarily concerned with recognition, so the output is either "yes" or "error".)
In contrast, this grammar is not LL(*). Intuitively, an LL(*) parser must decide, when it sees the first a
, whether to use the first or second production of S
. It would like to look ahead to see if there are as many b
symbols as a
symbols remaining, because if not, then it would know it has to use the first production to "burn" the excess a
symbols. But LL(*) lookahead is limited to recognizing a regular language, and a regular language cannot recognize { a^i b^i }
since it cannot "count".
Of course, the fact that one grammar is not LL(*) does not imply that the language is not LL(*), because there might be a more clever grammar. To prove it is not LL(*), I would probably start with the formal definition, assume I had a grammar with those conditions, and then use a pumping lemma argument to show it can't correctly recognize the language of interest. But I'll let the linked resources suffice as rigorous justification that the language is not LL(*).
The way I think about it, LL makes decisions on the way "down" the parse tree, while LR makes them on the way "up". To make a language that is not LL(k), we arrange it so a putative parser would have to commit to an interpretation of a symbol when the information it needs to do so is beyond the horizon of k
symbols. To make it not LL(*), we need to put the crucial information beyond a horizon that can only be crossed by first recognizing a non-regular language.
In contrast, LR can push symbols onto its stack, delaying their interpretation until it has seen the end of the associated production and already constructed interpretations of everything in between.
To try to make this a bit more concrete, imagine a programming language that has two kinds of brace-enclosed things, say, code blocks and object literals (like Javascript). Imagine they both can occur in the same context (unlike Javascript):
var x = { console.log("I am a code block"); /*result is*/ 6; };
var x = { a:1, b:2 };
In that context, a parser encounters {
. LL must decide immediately whether that's the start of a code block or an object literal. In Javascript, object literal keys have to be identifiers or string literals, and the union of both of those is a regular language, so an LL(*) parser can skip past the regex for "identifier or stringlit" to check for :
, which would signal object literal (otherwise code block).
{ // hmmm, code or object?
{ a // possible object literal key
{ a : // a-ha! definitely object literal
If instead a key could be an arbitrary string-typed expression, then LL(*) is in trouble because it has to balance parentheses to get past the putative key so it can check for the :
:
{ // start of object literal?
{ ( // uh-oh ...
{ (a // I'm
{ (a ? // getting
{ (a ? b // lost
{ (a ? b : // is this the ':' after a key? help!
In contrast, LR happily defers the interpretation of {
, pushing it onto a stack, and proceeds, in effect, with two potential interpretations until some token disambiguates them.
Hopefully this provides some intuition for what sorts of things LR contains but LL(*) does not.
There are examples of the reverse (LL(*) but not LR), although I don't know what they look like ("not LR" is a hard class to think about); see the first linked question for details.
Now, the question title actually asks about Antlr4. Antlr4 has semantic predicates, effectively allowing the programmer to insert arbitrary lookahead computation. So if you're willing to step outside the grammar formalism, in fact there is no limit (short of decidability) on what an Anltr4 parser can recognize.
Here's an example grammar, for which there are examples that neither ANTLR nor YACC can parse without hackery:
stmt = declaration ';' ;
stmt = expression ';' ;
declaration = type ID ;
type = ID '*' ;
expression = product ;
product = ID ;
product = product '*' ID ;
Here's an example that neither ANTLR or L(AL)R can parse:
x * y ;
using this grammar.
There are two possible parses: 1) as a statement, 2) as a declaration.
This example is taken from the C language. (You can make a smaller grammar; I tried to leave this in a form that was easy to understand its realism).
L(ALR) parsers and ANTLR will provide you at most one derivation. Which means they will each miss any alternative.
One can hack the parsing machinery to resolve this (as GCC famously used to do) by bringing in symbol table information. That tangles parsing and symbol table construction which IMHO simply makes a mess out of the parser. In particular, you can't decide what it accepts by just looking at the grammar.
The problem is that ANTLR and L(AL)R will not parse all context free langauges. They parse only (different) subsets; this means one can parse certain instances that the other cannot and vice versa. This means one is not more powerful than the other.
If you want a more powerful parser, e.g., handles fully context free languages, you need to look at Earley parsers, or GLR or GLL parsers. Earley parsers aren't very efficient. GLR and GLL tend to be efficient parsers on parts of the grammar which don't introduce ambiguities (multiple parses) or require large amounts of lookahead. But most programming language construct you might want to parse tend not to be confusing because people have a hard time reading them in that case, too.
Because of the limits of L(AL)R and ANTLR, my company (Semantic Designs) uses GLR parsers for some 40 languages. This includes full C++17 which has more ambiguous parses and long-lookahead cases than you can shake a stick at, including the above example, most of them a lot more obscure. As far as I know, we have the only C++17 parser that uses a grammar directly; GCC and Clang use hand-coded recursive descent with lots of tangled symbol table checks.
[The author of the other answer, Scott McPeak, did build C++ parsers for older dialects of C++ using GLR, more or less at the same time we started to do this. Hats off to him.]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With