What makes Java easier to parse than C?

Tags:

I'm acquainted with the fact that the grammars of C and C++ are context-sensitive, and in particular you need a "lexer hack" in C. On the other hand, I'm under the impression that you can parse Java with only 2 tokens of look-ahead, despite considerable similarity between the two languages.

What would you have to change about C to make it more tractable to parse?

I ask because all of the examples I've seen of C's context-sensitivity are technically allowable but awfully weird. For example,

foo (a);

could be calling the void function foo with argument a. Or, it could be declaring a to be an object of type foo, but you could just as easily get rid of the parantheses. In part, this weirdness occurs because the "direct declarator" production rule for the C grammar fulfills the dual purpose of declaring both functions and variables.

On the other hand, the Java grammar has separate production rules for variable declaration and function declaration. If you write

foo a;

then you know it's a variable declaration and foo can unambiguously be parsed as a typename. This might not be valid code if the class foo hasn't been defined somewhere in the current scope, but that's a job for semantic analysis that can be performed in a later compiler pass.

I've seen it said that C is hard to parse because of typedef, but you can declare your own types in Java too. Which C grammar rules, besides direct_declarator, are at fault?

425

asked Oct 12 '14 21:10

Daniel Shapero

1 Answers

Parsing C++ is getting hard. Parsing Java is getting to be just as hard.

See this SO answer discussing why C (and C++) is "hard" to parse. The short summary is that C and C++ grammars are inherently ambiguous; they will give you multiple parses and you must use context to resolve the ambiguities. People then make the mistake of assuming you have to resolve ambiguities as you parse; not so, see below. If you insist on resolving ambiguities as you parse, your parser gets more complicated and that much harder to build; but that complexity is a self-inflicted wound.

IIRC, Java 1.4's "obvious" LALR(1) grammar was not ambiguous, so it was "easy" to parse. I'm not so sure that modern Java hasn't got at least long distance local ambiguities; there's always the problem of deciding whether "...>>" closes off two templates or is a "right shift operator". I suspect modern Java does not parse with LALR(1) anymore.

But one can get past the parsing problem by using strong parsers (or weak parsers and context collection hacks as C and C++ front ends mostly do now), for both languages. C and C++ have the additional complication of having a preprocessor; these are more complicated in practice than they look. One claim is that the C and C++ parsers are so hard they have to be be written by hand. It isn't true; you can build Java and C++ parsers just fine with GLR parser generators.

But parsing isn't really where the problem is.

Once you parse, you will want to do something with the AST/parse tree. In practice, you need to know, for every identifier, what its definition is and where it is used ("name and type resolution", sloppily, building symbol tables). This turns out to be a LOT more work than getting the parser right, compounded by inheritance, interfaces, overloading and templates, and the confounded by the fact that the semantics for all this is written in informal natural language spread across tens to hundreds of pages of the language standard. C++ is really bad here. Java 7 and 8 are getting to be pretty awful from this point of view. (And symbol tables aren't all you need; see my bio for a longer essay on "Life After Parsing").

Most folks struggle with the pure parsing part (often never finishing; check SO itself for the many, many questions about to how to build working parsers for real langauges), so they don't ever see life after parsing. And then we get folk theorems about what is hard to parse and no signal about what happens after that stage.

Fixing C++ syntax won't get you anywhere.

Regarding changing the C++ syntax: you'll find you need to patch a lot of places to take care of the variety of local and real ambiguities in any C++ grammar. If you insist, the following list might be a good starting place. I contend there is no point in doing this if you are not the C++ standards committee; if you did so, and built a compiler using that, nobody sane would use it. There's too much invested in existing C++ applications to switch for convenience of the guys building parsers; besides, their pain is over and existing parsers work fine.

You may want to write your own parser. OK, that's fine; just don't expect the rest of the community to let you change the language they must use to make it easier for you. They all want it easier for them, and that's to use the language as documented and implemented.

170

answered Sep 26 '22 11:09

Ira Baxter

Related questions
                            
                                Create a new color drawable
                            
                                JDK8 - Error "class file for javax.interceptor.InterceptorBinding not found" when trying to generate javadoc using Maven javadoc plugin
                            
                                Is it ok if I omit curly braces in Java? [closed]
                            
                                Integer to two digits hex in Java
                            
                                SimpleDateFormat and locale based format string
                            
                                Setting a timeout for socket operations
                            
                                Turning off hibernate logging console output
                            
                                Android: Alternative for context.getDrawable()
                            
                                How to sanity check a date in Java
                            
                                What is the size of a boolean variable in Java?
                            
                                Why StringBuilder when there is String?
                            
                                android: changing option menu items programmatically
                            
                                Splitting a string at every n-th character
                            
                                Converting Milliseconds to Minutes and Seconds?
                            
                                Regular expression to validate username
                            
                                How do I fill arrays in Java?
                            
                                How can I increment a variable without exceeding a maximum value?
                            
                                Convert from days to milliseconds
                            
                                Issue with recording from the Open ONVIF (Network Video Interface Forum ) device
                            
                                Stop spacebar keypress from triggering autocomplete in Eclipse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What makes Java easier to parse than C?

Tags:

java

c

parsing

grammar

Daniel Shapero

People also ask

1 Answers

Ira Baxter

Recent Activity

Donate For Us