Representing an Abstract Syntax Tree in C

Tags:

I'm implementing a compiler for a simple toy language in C. I have a working scanner and parser, and a reasonable background on the conceptual function/construction of an AST. My question is related to the specific way to represent an AST in C. I've come across three styles pretty frequently in different texts/resources online:

One struct per type of node.

This has a base node "class"(struct) that is the first field in all the child structs. The base node contains an enum that stores the type of node(constant, binary operator, assignment, etc). Members of the struct are accessed using a set of macros, with one set per struct. It looks something like this:

struct ast_node_base {     enum {CONSTANT, ADD, SUB, ASSIGNMENT} class; };  struct ast_node_constant {     struct ast_node_base *base;     int value; };  struct ast_node_add {     struct ast_node_base *base;     struct ast_node_base *left;     struct ast_node_base *right; };  struct ast_node_assign {     struct ast_node_base *base;     struct ast_node_base *left;     struct ast_node_base *right; };  #define CLASS(node) ((ast_node_base*)node)->class;  #define ADD_LEFT(node) ((ast_node_add*)node)->left; #define ADD_RIGHT(node) ((ast_node_add*)node)->right;  #define ASSIGN_LEFT(node) ((ast_node_assign*)node)->left; #define ASSIGN_RIGHT(node) ((ast_node_assign*)node)->right;

One struct per layout of node.

This appears to be mostly the same as the above layout, except instead of having ast_node_add and ast_node_assign it would have an ast_node_binary to represent both, because the layout of the two structs is the same and they only differ by the contents of base->class. The advantage to this seems to be a more uniform set of macros(LEFT(node) for all nodes with a left and right instead of one pair of macros per), but the disadvantage seems that the C type checking won't be as useful(there would be no way to detect an ast_node_assign where there should only be an ast_node_add, for example).

One struct total, with a union to hold different types of node data.

A better explanation of this than I can give can be found here. Using the types from the previous example it would look like:

struct ast_node {   enum { CONSTANT, ADD, SUB, ASSIGNMENT } class;   union { int                                 value;           struct { struct ast_node* left;                        struct ast_node* right;  } op; };

I'm inclined to like the third option the most because it makes recursive traversal much easier(in that lots of pointer casting is avoided in favor of the union), but it also doesn't take advantage of C type checking. The first option seems the most dangerous in that it relies on pointers to structs being cast to access the member of any node(even different members of the same node requiring different cases to access(base vs. left)), but these casts are type checked so that might be moot. The second option to me seems like the worst of both worlds, although maybe I'm missing something.

Which of these three schemes are the best, and why? Is there a better fourth option I haven't come across yet? I'm assuming none of them are a "one size fits all" solution, so if it matters the language I'm implementing is a statically typed imperative language, almost a small subset of C.

A specific question I have about the third(union) layout. If I use only the value field, will there be empty space following the value to accommodate for the possibility of op being written to?

727

asked Jan 15 '14 23:01

user1547129

1 Answers

You can make any of these work.

I prefer the union layout, because then all nodes have "the same" layout.

[You may find it useful to have a "child sublist" option, e.g., and arbitarily big, dynamic array of children, instead of having left- or right-leaning lists.]

You are going to find that this issue isn't the one that makes building your compiler hard. Rather, it is having symbol tables, performing various kinds of analyses, choosing a machine-level IR, building a code generator, and doing code optimizations. Then you're going to encounter real users and you'll discover what you really did wrong :-}

I'd pick one and run with it, so that you have a chance to get near the other issues.

131

answered Oct 03 '22 21:10

Ira Baxter

Related questions
                            
                                Will rand() sometimes return the same consecutively?
                            
                                Why a pointer + 1 add 4 actually
                            
                                Can you #define a comment in C?
                            
                                How to get a list of video capture devices (web cameras) on linux ( ubuntu )? (C/C++)
                            
                                How to determine how many bytes an integer needs?
                            
                                Alternative to Eclipse for C and C++ development? [closed]
                            
                                How to sum large numbers?
                            
                                Why does subtracting '0' in C result in the number that the char is representing?
                            
                                The most useful user-made C-macros (in GCC, also C99)? [closed]
                            
                                How to avoid use of goto and break nested loops efficiently
                            
                                how to get hexdump of a structure data
                            
                                Is C/C++ one language or two languages?
                            
                                What coding techniques do you use for optimising C programs? [closed]
                            
                                Why does n++ execute faster than n=n+1?
                            
                                Integer to IP Address - C
                            
                                Does GCC generate suboptimal code for static branch prediction?
                            
                                Preventing MSYS 'bash' from killing processes that trap ^C
                            
                                Functional programming in C with macro "Higher Order Function" generators
                            
                                Two arguments to calloc
                            
                                Is there a C project Default Directory Layout?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Representing an Abstract Syntax Tree in C

Tags:

c

struct

tree

compiler-construction

abstract-syntax-tree

user1547129

People also ask

1 Answers

Ira Baxter

Recent Activity

Donate For Us