Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are C# and Java Grammars LALR(x)?

I wonder if C# and Java grammars are LALR(x)? If yes, what's the value of x?

Edit:

After accepting the true answer, I think it is better to change the Q in this way:

Is there any LALR(x) parser that could parse current releases of Java (version 7) or C# (version 4)? If yes, what is the value of x?

like image 375
Gupta Avatar asked Dec 04 '11 20:12

Gupta


2 Answers

You can't ask this question without first designating a specific grammar for a langauge, as some grammars may be, and some may not.

Perhaps you mean the Java grammar as published in recent Java specifications. Do you mean for Java 7?

I'm not sure you can designate a specific grammar for C#, at least not one from Microsoft, especially for C# 4.0; I don't believe they have published a grammar.

I can tell you that i don't think C# can be LALR(x), because it has some elements which look like identifiers, but can be keywords in certain contexts. This requires the lexer to know what the parser is expecting to decide if an identifier-like token is a keyword, or just and identifier. Thus there has to be feedback from the parser to lexer, or the lexer has to produce both tokens and pass them to the parser to decide which it wants. LALR parsers are defined on token streams without any feedback, and where every input token has only one interpretation.

I don't think Java is, either, from Java 1.5 and up, when enum was introduced as a special type with its own keyword. This is because, for Java 1.5 compilers to process existing Java 1.4 programs that used enum as a variable name, enum must be treated as a keyword in some contexts, and as a variable name in others. So a Java 1.5 parser has the same issues as C# does.

As a practical matter, no real langauges are LALR(1) [first edition Java may be an exception] and anybody building a real parser (esp LALR) has to make some kind of hack to get around this. (GCC famously parsed C++ with an LALR parser with an awful symbol table hack for a long time, so it could tell the difference between an identifier as a variable, and an identifier as a typedef instance. It now has some kind of hand-implemented recursive descent parser, but I think the awful hack remains). So I'm not sure the value of answer to your question.

Our C# 4.0 and Java 7 members of our family of language front ends both parse the languages using a GLR parser, extended both with the feedback capability, and the ability to process two interpretations of the same token. GLR makes the question of LALR(x) moot, and the feedback and multiple interpretations let us handle many languages that would be outside of pure GLR's capability, too.

EDIT: After a bit of thought, there might be a truly ugly way to make both grammars handle their keyword-in-context. Let's use Java's enum as an example. There realistically has to be grammar rule:

  type = 'enum' '{'  enum_members '}' ;

But we also need to allow 'enum' as an identifer. We can do that, by replacing the terminal token identifier with a nonterminal:

  identifier = IDENTIFIER | 'enum' ;

and insist that IDENTIFIERs are the terminals produced by the lexer. Now at least the lexer does not have to decide how to treat enum; the parser does. But your designated grammar would have to shaped like this in order to even have a chance of being LALR(x).

Our parsers used to do this to allow some keywords to be used sometimes as identifiers. We changed our parsing engine as described earlier, and don't do this any more.

like image 105
Ira Baxter Avatar answered Nov 14 '22 12:11

Ira Baxter


The Java grammar (version 1.0) is known to be LALR(1); this site provides a grammar and begins with the notice that

The grammar has been mechanically checked to insure that it is LALR(1).

I am not sure whether C# is LALR(1), but there is a C# parser written in bison available here, which suggests that it's probably LALR(1) (assuming that you allow for precedence declarations).

For what it's worth, typically LALR(1) is the only LALR parser used. If you need to use something like LALR(2) for a grammar, it is typically a better idea to use a LALR(1) parser with explicit precedence disambiguation, or a more powerful parser like a GLR parser.

Hope this helps!

like image 14
templatetypedef Avatar answered Nov 14 '22 11:11

templatetypedef