Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Representing a multiple pass Abstract Syntax Tree (AST) in C++?

I am currently exploring designing a compiler that transforms its AST in multiple stages. The idea is that starting from the parse tree, each pass transforms the tree until the resulting AST is optimised and contains all of the required information in each node of the tree required to generate the intermediate code (in this case LLVM IR). A pass over the tree may considerably change its structure, for example changing a list of operators and operands into a hierarchy of ordered operations via operator precedence parsing. Note that a pass may leave parts of the structure entirely unchanged.

So, my question is how do I best (read: most easily, with as little repetition as possible) represent an AST that has multiple intermediate representations in C++? I would like node types from each phase's version of the AST to respect their incompatibility at compile time. I believe that the key issue is how should I represent parts of the structure that do not change between passes while avoiding repetitive code? I imagine that this is a problem solved many times in the past by compiler authors.

Note that I am currently using Boost Variant instead of normal runtime polymorphism in my AST, and would like a solution to be compatible with it as well.

like image 343
Dylan Avatar asked Apr 16 '13 17:04

Dylan


People also ask

What is AST in C?

An abstract syntax tree (or an AST) is a tree-shaped representation of source code that is convenient for a compiler to operate. A compiler might represent an expression like 4 + 2 * 10 + 3 * (5 + 1) using a tree structure like this: Figure 1. Abstract syntax tree example.

What is AST in syntax?

AST stands for Abstract Syntax Tree and they power a lot of parts of your development flow. Some people might have heard about them in the context of compilers but they are being used in a variety of tools. Even if you don't write general development tools, ASTs can be a useful thing to have in your toolbelt.

Is a parse tree an AST?

The abstract syntax tree (AST) resembles the parse tree for the input program. It includes the important syntactic structure of the program while omitting any nonterminals that are not needed to understand that structure.

How do you use AST?

How to do using ast library, a = b + 3 or a = 3+b , both have same node type i.e. BinOp, you can validate variable “a” value and its node type. For each line of code, create AST node then compare value, node type and other parameters as well like operator, operand, function name, class name, index, etc… if required.


2 Answers

AST nodes by themselves don't need huge amounts of complexity. I think all this AST node machinery is just overkill.

The problem with ASTs isn't node type safety; its tree shape safety. An AST represents (presumably) some valid instance of some language L. What you ideally want is for transformations on the AST to produce other valid ASTs (instances of the language L). You're not going to guarantee that by guaranteeing that any one node has a valid type; you can only do it by guaranteeing that any tree patch produces a valid tree. And this is very difficult to do if the tree operations are atomic (e.g., "change node that", "replace child", "replace parent") and applied seperately; after several such steps, what precisely can you say about the tree?

This is better done using a kind of tree-rewrite transaction, e.g., source-to-source transformations whose grammatical structure is valid for language L, and which are applied in places that are valid for that transformation.

Most standard program transformation systems do this. They achieve this by holding a model of grammar for L, and checking that the proposed transforms are well-typed. This ensures that transformations of language L to language L stay well-formed.

This is harder to get right if the transformations map from one language A to another language B; if some such transformations are applied, you usually get a tree with mixed types that is not legal in either language. With care, one can define a set of transforms that map all subtrees of language A to language B, and apply them exhaustively; then you want the resulting tree to be well formed for B. You can ensure that by insisting whenever a B-patch is inserted in a mixed tree, if it is adjacent to another B-patch, that the resulting compound B-patch is well formed. This you can do using the same style of grammar checks.

Using these ideas, you can build a system that maps an AST through a series of "representations" (langauges A, B, C, ....) and have some faith that the result tree is well-shaped. This idea generalizes to graph rewrites.

like image 198
Ira Baxter Avatar answered Oct 07 '22 00:10

Ira Baxter


Here is a quick stab at a type-safe boost::variant based AST.

I included a simple "structure preserving transform" that simply changes the type of data stored in each AST node. In theory, however, you can write an arbitrary astFunc that both does a structural and data based transform of the nodes -- just write a type_list that contains the valid types in each node before and after.

template<typename... Ts>
struct type_list {};

// specialize data_type to store something special in your AST node:
// (by default, an entry means "the type of the data")
tempalte<typename T>
struct data_type { typedef T type; };
template<typename T>
using DataType = typename data_type<T>::type;

template<template<typename>class F, typename typelist>
struct map_types;
template<template<typename>class F, template<typename...>L, typename... Ts>
struct map_types<F, L<Ts...>> {
  typedef L< F<Ts>... > type;
};
template<template<typename>class F, typename typelist>
using MapTypes = typename map_types<F, typelist>::type;

template<template<typename...>class F, typename typelist>
struct apply_list;
template<template<typename...>class F, template<typename...>class L, typename... Ts>
struct apply_list<F, L<Ts...>> {
  typedef F<Ts...> type;
};
template<template<typename...>class F, typename typelist>
using ApplyList = typename apply_list<F, typelist>::type;
template<typename typelist>
using Var = ApplyList< boost::variant, MapTypes<DataType, typelist> >;

template<typename type_list>
struct AST_Node {
  typedef std::unique_ptr<AST_Node> upAST_Node;
  std::vector<upAST_Node> children;
  Var<type_list> data;
  template<typename T>
  AST_Node( T&& t ):data( std::forward<T>(t) ) {}
};
template<typename type_list>
using upAST_Node = typename AST_Node<type_list>::upAST_Node;

template<typename before_types, typename after_types>
using typeFunc = std::function< Var<after_types>(Var<before_types>) >;

template<typename before_types, typename after_types>
using astFunc = std::function< upAST_Node<after_types>(upAST_Node<before_types>) >;

template<typename before_types, typename after_types>
astFunc<before_types, after_types> elementWiseTransform( typeFunc<before_types, after_types> func ) {
  return [func]( upAST_Node<before_types> before )->upAST_Nodes<after_types> {
    upAST_Node<after_types> after( new AST_Node<after_types>( func( before ) ) );
    after->children.reserve( before->children.size() );
    for( auto& child: before->children ) {
      after->children.push_back( elementWiseTransform(func)(std::move(child)) );
    }
    return after;
  };
}

Now this is just a start.

You could go further, and have each type of node have a different set of types of children, or even a different number. Simply create traits classes for each type of node like my data_type, such as children_types. Then use a similar technique to how I defined Var to define the type of your children. Basically, you have a variant of std::vector< AST_Node<ChildType<type_list_element>>> via a chaining of MapTypes. Heck you could bundle the std::vector of children and the data together into one variant.

This would allow you to write an mapping for an individual AST_Node type (which makes to another AST_Node type), aggregate them all together and generate a AST_Node<before, after> functor that would then walk over the tree. Some of the functors would operate on the data only then let the parent logic take over for children, some would transform entire subtrees, some would operate on the data and stop the parent logic from running over the children.

That technique gets tricky, because you have to synthesize boost variant visitors from your individual functions in a way that doesn't require piling them all together. If you look here you'll see a few techniques on how to take a bunch of std::function<T(U)> and turn them into one functor that takes any one of the union of U. Toss in some work to calculate the union of the return types (a simple type_list with duplicate types removed, then pumped into a boost::variant, might do it) -- such a "merged functor" would be a valid visitor.

And now you can write "remap an AST node of type operator_add" functors, and "remap an AST node of type operator_mult", and a few others, tie them together into a mega-functor, throw them at an AST traversal algorithm, and have it spew out an AST tree with some types converted to other types...

But that would be a lot of work.

Oh, and we might want "phase tagging", where a phase 1 and phase 2 AST are different types. We could tag each type in the type_list with its phase, or we could just tag the AST tree itself. Heck, we could name the phases for the AST using otherwise unused structs, and define a progression through phases as a type to type functor that gets applied and enforced in the signature of astFunc<before_phase, before_types, after_phase, after_types>.

So that isn't bad. We create a type_list of node types. These types need not be the actual data stored. But it can be.

We create a data_type traits class that maps each node type to the data stored. We create a child_types traits class that maps each node type to the type_list of the child ASTs.

Each AST_Node stores a variant<AST_Concrete_Node<Ts>...>. AST_Concrete_Node contains a DataType<T> data; and a MapTypes< std::vector, MapTypes< AST_Node, ChildTypes<T> > > children; (aka, std::vector< AST_Node<ChildrenTypes...> >, but you cannot say that directly easily).

Next, AST_Concrete_Node<T> transformation functions are joined together in a tricky bit of template metaprogramming into boost variant visitors. This step is really tricky, but I think doable. Extra work is done so that unmentioned types are skipped, so we don't have to constantly say "oh, and we don't want to transform node X", but rather have to say "if we hit node Y, don't transform its kids".

At this point, I'm going to say I am blathering on -- not having done this before, the problems encountered in a concrete implementation of this mess of type gymnastics is going to overwhelm my ability to abstractly reason about it. But the idea is hopefully useful -- we have type-safe node-type transformations which we aggregate together and generate a type-safe tree transformation. The tree isn't merely an abstract tree of universal variants, but an tree where each node knows what the types allowed in its children are, which recursively know the same. We could even handle "this must have exactly 3 children, the first of which is an int, the second is a Bob, and the third is a double" if we went far enough down the rabbit hole.

like image 38
Yakk - Adam Nevraumont Avatar answered Oct 07 '22 02:10

Yakk - Adam Nevraumont