Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Representing an Abstract Syntax Tree in C

I'm implementing a compiler for a simple toy language in C. I have a working scanner and parser, and a reasonable background on the conceptual function/construction of an AST. My question is related to the specific way to represent an AST in C. I've come across three styles pretty frequently in different texts/resources online:

One struct per type of node.

This has a base node "class"(struct) that is the first field in all the child structs. The base node contains an enum that stores the type of node(constant, binary operator, assignment, etc). Members of the struct are accessed using a set of macros, with one set per struct. It looks something like this:

struct ast_node_base {     enum {CONSTANT, ADD, SUB, ASSIGNMENT} class; };  struct ast_node_constant {     struct ast_node_base *base;     int value; };  struct ast_node_add {     struct ast_node_base *base;     struct ast_node_base *left;     struct ast_node_base *right; };  struct ast_node_assign {     struct ast_node_base *base;     struct ast_node_base *left;     struct ast_node_base *right; };  #define CLASS(node) ((ast_node_base*)node)->class;  #define ADD_LEFT(node) ((ast_node_add*)node)->left; #define ADD_RIGHT(node) ((ast_node_add*)node)->right;  #define ASSIGN_LEFT(node) ((ast_node_assign*)node)->left; #define ASSIGN_RIGHT(node) ((ast_node_assign*)node)->right; 

One struct per layout of node.

This appears to be mostly the same as the above layout, except instead of having ast_node_add and ast_node_assign it would have an ast_node_binary to represent both, because the layout of the two structs is the same and they only differ by the contents of base->class. The advantage to this seems to be a more uniform set of macros(LEFT(node) for all nodes with a left and right instead of one pair of macros per), but the disadvantage seems that the C type checking won't be as useful(there would be no way to detect an ast_node_assign where there should only be an ast_node_add, for example).

One struct total, with a union to hold different types of node data.

A better explanation of this than I can give can be found here. Using the types from the previous example it would look like:

struct ast_node {   enum { CONSTANT, ADD, SUB, ASSIGNMENT } class;   union { int                                 value;           struct { struct ast_node* left;                        struct ast_node* right;  } op; }; 

I'm inclined to like the third option the most because it makes recursive traversal much easier(in that lots of pointer casting is avoided in favor of the union), but it also doesn't take advantage of C type checking. The first option seems the most dangerous in that it relies on pointers to structs being cast to access the member of any node(even different members of the same node requiring different cases to access(base vs. left)), but these casts are type checked so that might be moot. The second option to me seems like the worst of both worlds, although maybe I'm missing something.

Which of these three schemes are the best, and why? Is there a better fourth option I haven't come across yet? I'm assuming none of them are a "one size fits all" solution, so if it matters the language I'm implementing is a statically typed imperative language, almost a small subset of C.

A specific question I have about the third(union) layout. If I use only the value field, will there be empty space following the value to accommodate for the possibility of op being written to?

like image 727
user1547129 Avatar asked Jan 15 '14 23:01

user1547129


People also ask

How do you syntax an abstract tree?

Typically, you would split the work into a tokenizer which splits the input stream representing the expression into a list of tokens, and a parser which takes the list of tokens and constructs a parse tree\ast from it. The first column is the actual text value. The second represents the token type.

What are the methods of representing a syntax tree?

Rules for constructing a syntax treeEach node in a syntax tree can be executed as data with multiple fields. In the node for an operator, one field recognizes the operator and the remaining field includes a pointer to the nodes for the operands. The operator is known as the label of the node.

What is meant by abstract syntax tree?

An Abstract Syntax Tree, or AST, is a tree representation of the source code of a computer program that conveys the structure of the source code. Each node in the tree represents a construct occurring in the source code.

Is abstract syntax tree and syntax tree same?

An Abstract Syntax Tree describes the parse tree logically. It does not need to contain all the syntactical constructs required to parse some source code (white spaces, braces, keywords, parenthesis etc). That's why Parse Tree is also called Concrete Syntax Tree while the AST is called Syntax Tree .


1 Answers

You can make any of these work.

I prefer the union layout, because then all nodes have "the same" layout.

[You may find it useful to have a "child sublist" option, e.g., and arbitarily big, dynamic array of children, instead of having left- or right-leaning lists.]

You are going to find that this issue isn't the one that makes building your compiler hard. Rather, it is having symbol tables, performing various kinds of analyses, choosing a machine-level IR, building a code generator, and doing code optimizations. Then you're going to encounter real users and you'll discover what you really did wrong :-}

I'd pick one and run with it, so that you have a chance to get near the other issues.

like image 131
Ira Baxter Avatar answered Oct 03 '22 21:10

Ira Baxter