Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good tools for creating a C/C++ parser/analyzer [closed]

Tags:

c++

c

parsing

yacc

lex

What are some good tools for getting a quick start for parsing and analyzing C/C++ code?

In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.

Here's what I've found so far, but haven't looked at them in detail (thoughts?):

  • CScope - Old-school C analyzer. Doesn't seem to do a full parse, though. Described as a glorified 'grep' for finding C functions.
  • GCC - Everybody's favorite open source compiler. Very complicated, but seems to do it all. There's a related project for creating GCC extensions called GEM, but hasn't been updated since GCC 4.1 (2006).
  • PUMA - The PUre MAnipulator. (from the page: "The intention of this project is to provide a library of classes for the analysis and manipulation of C/C++ sources. For this purpose PUMA provides classes for scanning, parsing and of course manipulating C/C++ sources."). This looks promising, but hasn't been updated since 2001. Apparently PUMA has been incorporated into AspectC++, but even this project hasn't been updated since 2006.
  • Various C/C++ raw grammars. You can get c-c++-grammars-1.2.tar.gz, but this has been unmaintained since 1997. A little Google searching pulls up other basic lex/yacc grammars that could serve as a starting place.
  • Any others?

I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.

Thanks! -Matt

(Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)

like image 818
Matt Ball Avatar asked Feb 09 '09 00:02

Matt Ball


People also ask

Is C hard to parse?

C is a bit hard to parse because statements like `A * B();` will mean different things if A is defined as a type or note. C++ is much harder to parse because the template syntax is hard to disambiguate from less than or greater than.

Which parser is used in C?

C is (mostly) parseable with an LALR(1) grammar, although you need to implement some version of the "lexer hack" in order to correctly parse cast expressions.

Which parser is used in GCC compiler?

GCC used to have YACC based parser, but this was replaced with a handwritten recursive descent parser, later on.


2 Answers

Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:

Outstandingly complicated grammar

"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is AA BB(CC); an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.

like image 122
Adam Rosenfield Avatar answered Sep 23 '22 08:09

Adam Rosenfield


You can look at clang that uses llvm for parsing.

Support C++ fully now link

like image 31
epatel Avatar answered Sep 26 '22 08:09

epatel