What are some good tools for getting a quick start for parsing and analyzing C/C++ code?
In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.
Here's what I've found so far, but haven't looked at them in detail (thoughts?):
I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.
Thanks! -Matt
(Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)
C is a bit hard to parse because statements like `A * B();` will mean different things if A is defined as a type or note. C++ is much harder to parse because the template syntax is hard to disambiguate from less than or greater than.
C is (mostly) parseable with an LALR(1) grammar, although you need to implement some version of the "lexer hack" in order to correctly parse cast expressions.
GCC used to have YACC based parser, but this was replaced with a handwritten recursive descent parser, later on.
Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:
Outstandingly complicated grammar
"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is
AA BB(CC);
an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.
You can look at clang that uses llvm for parsing.
Support C++ fully now link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With