I am trying to write a program to check that some C source code conforms to a variable naming convention. In order to do this, I need to analyse the source code and identify the type of all the local and global variables.
The end result will almost certainly be a python program, but the tool to analyse the code could either be a python module or an application that produces an easy-to-parse report. Alternatively (more on this below) it could be a way of extracting information from the compiler (by way of a report or similar). In case that's helpful, in all likelihood, it will be the Keil ARM compiler.
I've been experimenting with ctags and this is very useful for finding all of the typedefs and macro definitions etc, but it doesn't provide a direct way to find the type of variables, especially when the definition is spread over multiple lines (which I hope it won't be!).
Examples might include:
static volatile u8 var1; // should be flagged as static and volatile and a u8 (typedef of unsigned 8-bit integer)
volatile /* comments */
static /* inserted just to make life */
u8 /* difficult! */ var2 =
(u8) 72
; // likewise (nasty syntax, but technically valid C)
const uint_16t *pointer1; // flagged as a pointer to a constant uint_16t
int * const pointer2; // flagged as a constant pointer to an int
const char * const pointer3; // flagged as a constant pointer to a constant char
static MyTypedefTYPE var3; // flagged as a MyTypedefTYPE variable
u8 var4, var5, var6 = 72;
int *array1[SOME_LENGTH]; // flagged as an array of pointers to integers
char array2[FIRST_DIM][72]; // flagged as an array of arrays of type char
etc etc etc
It will also need to identify whether they're local or global/file-scope variables (which ctags can do) and if they're local, I'd ideally like the name of the function that they're declared within.
Also, I'd like to do a similar thing with functions: identify the return type, whether they're static and the type and name of all of their arguments.
Unfortunately, this is rather difficult with the C syntax since there is a certain amount of flexibility in parameter order and lots of flexibility in the amount of white space that is allowed between the parameters. I've toyed with using some fancy regular expressions to do the work, but it's far from ideal as there are so many different situations that can be applied, so the regular expressions quickly become unmanageable. I can't help but think that compilers must be able to do this (in order to work!), so I was wondering whether it was possible to extract this information. The Keil compiler seems to produce a ".crf" file for each source file that's compiled and this appears to contain all of the variables declared in that file, but it's a binary format and I can't find any information on how to parse this file. Alternatively a way of getting the information out of ctags would be perfect.
Any help that anyone can offer with this would be gratefully appreciated.
Thanks,
Al
There are a number of Python parser packages that can be used to describe a syntax and then it will generate Python code to parse that syntax.
Ned Batchelder wrote a very nice summary
Of those, Ply was used in a project called pycparser that parses C source code. I would recommend starting with this.
Some of those other parser projects might also have sample C parsers.
Edit: just noticed that pycparser even has a sample Python script to just parse C type declarations like the old cdecl program.
How about approaching it from the other side completely. You already have a parser that fully understands all of the nuances of the C type system: the compiler itself. So, compile the project with full debug support, and go spelunking in the debug data.
For a system based on formats supported by binutils, most of the detail you need can be learned with the BFD library.
Microsoft's debug formats are (somewhat) supported by libraries and documents at MSDN, but my Google-fu is weak today and I'm not putting my hands on the articles I know exist to link here.
The Keil 8051 compiler (I haven't used their ARM compiler here) uses Intel OMF or OMF2 format, and documents that the debug symbols are for their debugger or "any Intel-compatible emulators". Specs for OMF as used by Keil C51 are available from Keil, so I would imagine that similar specs are available for their other compilers too.
A quick scan of Keil's web site seems to indicate that they abandoned their proprietary ARM compiler in favor of licensing ARM's RealView Compiler, which appears to use ELF objects with DWARF format debug info. Dwarf should be supported by BFD, and should give you everything you need to know to verify that the types and names match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With