Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which contemporary computer languages are LL(1)?

(I am spending the holiday time on some language theory. Excuse me if this is a naive question.)

According to here:

LL grammars, particularly LL(1) grammars, are of great practical interest, as parsers for these grammars are easy to construct, and many computer languages are designed to be LL(1) for this reason.

So, out of curiosity, which contemporary computer langauges are LL(1) ? Does C, Java ,C# or Python fall into this category?

like image 616
smwikipedia Avatar asked Jan 01 '17 11:01

smwikipedia


People also ask

Is C an LL 1 grammar?

The C grammar is not LL(1): The bottom part shows a parser that has digested the tokens " int v ;main(){ " and is about choose a rule to derive the nonterminal " Stmt ".

Are all regular languages LL 1?

Every regular language has right linear grammar and this is LL(1). Thus, LL(1) grammar generates all regular languages.

How do you know if a language is LL 1?

If you have no FIRST/FIRST conflicts and no FIRST/FOLLOW conflicts, your grammar is LL(1). By seeing only the first input symbol a, you cannot know whether to apply the production S -> Xb or S -> Yc, because a is in the FIRST set of both X and Y.

What does LL 1 stand for?

In the name LL(1), the first L stands for scanning the input from left to right, the second L stands for producing a leftmost derivation, and the 1 stands for using one input symbol of lookahead at each step to make parsing action decision.


1 Answers

I think I'd be tempted to flag that Wikipedia quote with [citation needed]; the assumptions are, at least, questionable. For example, there are a large number of compiler-construction tools based on yacc which make it extremely easy, in practice, to construct a parser using the more powerful (and equally fast) LALR algorithm, and some also implement the even more power (and almost as fast, in most practical grammars) GLR algorithm. Hand-writing parsers has not been necessary for decades. [Note 1]

By way of attempting an answer to the question:

  1. Most computer languages are "technically" not LL because they are not even context-free. That would include languages which require identifiers to be declared (C, C++, C#, Java, etc.), as well as languages with preprocessors and/or macro facilities (C and C++, among others), languages with ambiguities which can only be resolved using semantic information (Perl would be the worst offender here, but C and C++ are also right up there). And, just to spread the joy around some more, it also includes languages which have Python-like layout awareness (Python, of course, and also Haskell).

    I put scare quotes around "technically" because there are a lot of people who will say that these exceptions "don't count". That is their opinion, and they are entitled to it (and anyway I've given up arguing about it, although I don't share that opinion). You could, for example, eliminate the preprocessor from C/C++ by preprocessing the text before applying the grammar, or preprocess whitespace-aware languages by inserting INDENT, NEWLINE and DEDENT tokens instead of the whitespace, after which you could make some kind of claim about a mystical "core language". (That is much harder to apply to C++ templates, which can only be eliminated by parsing the text, so you can't claim that the expansion can be done prior to parsing.)

    The claim that a language is not context-free because it requires identifiers to be declared is perhaps a bit more controversial. In some languages (not C/C++, in which the semantic analysis is required to avoid ambiguity), you could argue that a parse tree could be constructed without validating the declaration rule, which makes that rule "not syntactic". But it remains the case that you can enforce the declaration rule syntactically; you just cannot do it with a context-free grammar (you could use a Van Wijngaarden grammar, for example).

    In any case, a common parsing strategy is to recognize a superset of a language, and then reject non-conforming programs through a subsequent analysis of the resulting parse tree; in that case, the question becomes whether or not the superset is LL. In my opinion, in order for that to be meaningful, it is necessary that every valid program can be parsed to the correct syntax tree, which eliminates the usage of semantic analysis to disambiguate.

  2. The goal of parsing is to produce a syntax tree, not solely to recognize whether a text is valid or not. (You might miss this important fact if you skim over formal language textbooks which tend to focus on recognition, possibly because the construction of syntax trees is less interesting.) So it seems to be reasonable to insist that a proposed grammar actually represent the syntactic structure of the language.

    For example, you can recognize algebraic expressions using a simple regular language:

    (PREFIX* OPERAND POSTFIX*) (INFIX PREFIX* OPERAND POSTFIX*)* 

    where PREFIX is the set of prefix operators as well as (, POSTFIX is the set of postfix operators (if any) as well as ), INFIX is the set of infix operators (addition, subtraction, multiplication, etc.), and OPERAND is an identifier or constant.

    That regular expression will not correctly reject expressions with unbalanced parentheses, but we already agreed that it was OK to recognize a superset of the language, right? :-)

    If desired, we could remove the parentheses from the PREFIX and POSTFIX sets and make OPERAND a recursive production. The resulting grammar is trivially LL(1) [Note 2]:

    operand    => IDENTIFIER | CONSTANT | '(' expression ')' term       => operand | PREFIX operand | operand POSTFIX expression => term | term INFIX expression 

    The problem, however, is that this grammar does not capture operator precedence. It does not even attempt to recognize the fact that a+b*c and a*b+c are both sums, so that the top-level operator is + in both cases, and not whatever operator happens to come first in the expression. (If you were parsing APL, this wouldn't be an issue. But most languages conform to the usual expectations about operator precedence.)

    This point is important because an LL grammar cannot handle left-recursive productions, and you need a left-recursive production in order to correctly parse a left-associative operator. (That is, to correctly parse a-b-c as ((a-b)-c) rather than (a-(b-c)), which would have a different value.) Again, you could argue that this is a quibble, since you could post-process the parse tree in order to correct the associativity. This is certainly true, but the result is that the grammar you use to parse is different from the grammar you use to explain the syntax of the language. This might or might not bother you.

    It's worth adding here that there are languages (Haskell, Prolog, probably many more) in which you can define operators and their precedence in the program text. That obviously makes it impossible to generate a correct syntax tree without semantic analysis, and the usual approach to parsing such languages is to use precisely the simplified "alternating operand and operator" language outlined above. Once the operator precedences are all known, presumably at the end of the parse, all expressions are then reanalyzed using something like the Shunting Yard Algorithm, in order to produce the correct parse.

  3. Let's suppose we discard the above objections, and just ask "for which commonly used programming languages might an LL parser be used?"

    In order to avoid cheating, though, we should really require that the LL parser have fixed lookahead and not require backtracking. If you allow arbitrary lookahead and backtracking, then you considerably expand the domain of parseable languages, but I contend that you are no longer in the realm of LL. That will eliminate, for example, the template- and preprocessor-free subset of C++, even though the common compiler implementations use recursive descent parsers, since backtracking is required in order to resolve the "Most Vexing Parse" ambiguity. (Anyway, you can't really remove templates from C++, and parsing with templates is really hard. [Note 3])

    Java and Python were both designed to be LL(1) "pseudo-parseable". I believe Haskell falls into this category as well. C cannot be correctly parsed without semantic information (classic ambiguity: is (x)*(y) a cast or a multiplication? -- it depends on whether x has been typedef'ed or declared as a variable) but I'm pretty sure it can be captured with a non-backtracking recursive-descent parser. I haven't looked at C# in depth, but this answer by Eric Lippert suggests that it requires backtracking in some cases.

Notes

  1. Of course, people still do it, and in many cases for good reasons (producing better error messages, for example). But "it's difficult to construct an LALR parser" is not a good reason, since it's not.

  2. That's not quite LL, because I didn't left-factor. But it's easy enough to do; I'll leave it as an exercise.

  3. See Is C++ context-free or context-sensitive?. Also Todd Veldhuizen's classic C++ Templates are Turing Complete

like image 50
rici Avatar answered Oct 23 '22 09:10

rici