Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build AST from C code

How can I build an AST (Abstract Syntax Tree) from gcc C code in order to make some modifications, like converting some int variables to float, and reproduce(generate) the code to C syntax again after that.

Actually, for the moment, the only functionality I truly need is to extract a table of variables and their types from a c program consisting of few lines... I think there is a simple parser doing so.

I have some variables like:

int  var_bss ;           
float var_f_bss;            
int var_data = 4;        
float var_f_data = 5;  

And a function:

int Foo(){          
   some local variables;            
}    

The code is in a single c file.

I want to introduce all the variables to the end user to let him choose a source type in a specific memory segment e.g. int variables in the .data. Then the user can convert those variables into floats. Finally, I generate the same code for the user but with the new variable types those he has chosen.

like image 328
Asan Avatar asked Nov 27 '13 18:11

Asan


3 Answers

First, it is a difficult task, because the abstract syntax tree of C is much more complex than what you believe it is. Read the C11 standard n1570 for details, and see this website. Look also into tinyCC or nwcc (at least for inspiration).

Then if you are using a recent GCC (e.g. 4.7 or 4.8), I strongly suggest customizing GCC e.g. with a MELT extension (or your GCC plugin).

I don't claim it is a simple task, because very probably you need to understand the details of GCC internal representations (at least GIMPLE)

BTW, MELT is (was) a domain specific language to extend GCC, and is designed exactly for the kind of tasks you are dreaming about. You would be able with MELT to transform the internal GCC representations (Gimple and Tree-s). Today in 2020, MELT is not worked upon because of lack of funding.

The advantage of working inside GCC (or inside some other compiler like Clang/LLVM) is that you don't have to spit back some C code (which is actually much more difficult than what you think); you just transform the internal compiler representation and, perhaps most importantly, you take advantage "gratis" of the many things a compiler always do: all kind of optimizations like constant folding, inlining, common-subexpression elimination, etc, etc, etc, ....

In 2020, you could also consider using the libgccjit framework inside recent GCC 10, and read this draft report (related to Bismon; but see also RefPerSys, sharing some ideas but no code with Bismon). Try perhaps also the Clang static analyzer and/or Frama-C.

like image 52
Basile Starynkevitch Avatar answered Oct 04 '22 05:10

Basile Starynkevitch


What you are asking for is a C source-to-source transformer. Such a tool is very difficult to build, partly because of the inherent complexity of C, and partly because of the C preprocessor: the AST may contain fragments from system headers etc. that you need to handle properly while unparsing (emitting C code again at the end).

You could give Robert Grimm's SuperC a try: https://cs.nyu.edu/rgrimm/xtc/ That particular parser is supposed to handle all of C (including the preprocessor bits). I don't know if it can handle unparsing, but that should be comparatively easy (read: still lots of work) to do.

like image 34
creichen Avatar answered Oct 03 '22 05:10

creichen


Eli Bendersky's pycparser is a C source-to-source tool written in Python: https://github.com/eliben/pycparser

It will parse C99 and can build a detailed Parse Tree with nodes matching the grammar in the K&R "The C Programming Language" Appendix A ch. 13 "Grammar". It's built on a Python pseudo-implementation of lex/yacc, flex/bison whatever called PLY.

It has examples and it's really easy to get going. Like the other posters said, it is a complex task to reduce the parse tree to a minimal AST with all the irrelevant details left out.

This project can do source-to-source transformations as well: https://github.com/axw/cmonster/ CMonster is written in Python and wraps the Clang API.

If you wanna use GCC for the task, you should look into MELT. There is another project where the scripting language is JavaScript, but I can't remember the name ATM..

EDIT: responding to comments

Yeah, the framework that handled the intermediate representation was called TreeHydra and it is abandoned, but still working as far as I can see. There is a video tutorial online somewhere with the young Ph.D. dude that designed TreeHydra - think I found it with google video - explaining his choice of JS as interface language because of the popularity etc. He appeared knowledgeable and charismatic and I guess that's the reason that particular project stuck with me :) Haven't tried it out myself though.

I myself am working on a hobby Control Flow Graph and Data Flow Analysis tool using Eli Bendersky's framework as a building block. Of the toolkits I've tried, Eli's kit really seems the most promising. Together with inspiration from this particular cool project: Atul's Mini-C Compiler that utilizes the same Lex/Yacc Python port (PLY). Haven't done much yet, but it was easier getting going than learning libclang, although I do consider that a very promising route as well.

like image 30
Morten Jensen Avatar answered Oct 04 '22 05:10

Morten Jensen