I am currently exploring designing a compiler that transforms its AST in multiple stages. The idea is that starting from the parse tree, each pass transforms the tree until the resulting AST is optimised and contains all of the required information in each node of the tree required to generate the intermediate code (in this case LLVM IR). A pass over the tree may considerably change its structure, for example changing a list of operators and operands into a hierarchy of ordered operations via operator precedence parsing. Note that a pass may leave parts of the structure entirely unchanged. So, my question is how do I best (read: most easily, with as little repetition as possible) represent an AST that has multiple intermediate representations in C++? I would like node types from each phase's version of the AST to respect their incompatibility at compile time. I believe that the key issue is how should I represent parts of the structure that do not change between passes while avoiding repetitive code? I imagine that this is a problem solved many times in the past by compiler authors. Note that I am currently using Boost Variant instead of normal runtime polymorphism in my AST, and would like a solution to be compatible with it as well.

Here is a quick stab at a type-safe <code>boost::variant</code> based AST. I included a simple "structure preserving transform" that simply changes the type of data stored in each AST node. In theory, however, you can write an arbitrary <code>astFunc</code> that both does a structural and data based transform of the nodes -- just write a <code>type_list</code> that contains the valid types in each node before and after. <pre class="prettyprint"><code>template<typename... Ts> struct type_list {}; // specialize data_type to store something special in your AST node: // (by default, an entry means "the type of the data") tempalte<typename T> struct data_type { typedef T type; }; template<typename T> using DataType = typename data_type<T>::type; template<template<typename>class F, typename typelist> struct map_types; template<template<typename>class F, template<typename...>L, typename... Ts> struct map_types<F, L<Ts...>> { typedef L< F<Ts>... > type; }; template<template<typename>class F, typename typelist> using MapTypes = typename map_types<F, typelist>::type; template<template<typename...>class F, typename typelist> struct apply_list; template<template<typename...>class F, template<typename...>class L, typename... Ts> struct apply_list<F, L<Ts...>> { typedef F<Ts...> type; }; template<template<typename...>class F, typename typelist> using ApplyList = typename apply_list<F, typelist>::type; template<typename typelist> using Var = ApplyList< boost::variant, MapTypes<DataType, typelist> >; template<typename type_list> struct AST_Node { typedef std::unique_ptr<AST_Node> upAST_Node; std::vector<upAST_Node> children; Var<type_list> data; template<typename T> AST_Node( T&& t ):data( std::forward<T>(t) ) {} }; template<typename type_list> using upAST_Node = typename AST_Node<type_list>::upAST_Node; template<typename before_types, typename after_types> using typeFunc = std::function< Var<after_types>(Var<before_types>) >; template<typename before_types, typename after_types> using astFunc = std::function< upAST_Node<after_types>(upAST_Node<before_types>) >; template<typename before_types, typename after_types> astFunc<before_types, after_types> elementWiseTransform( typeFunc<before_types, after_types> func ) { return [func]( upAST_Node<before_types> before )->upAST_Nodes<after_types> { upAST_Node<after_types> after( new AST_Node<after_types>( func( before ) ) ); after->children.reserve( before->children.size() ); for( auto& child: before->children ) { after->children.push_back( elementWiseTransform(func)(std::move(child)) ); } return after; }; } </code></pre> Now this is just a start. You could go further, and have each type of node have a different set of types of children, or even a different number. Simply create traits classes for each type of node like my <code>data_type</code>, such as <code>children_types</code>. Then use a similar technique to how I defined <code>Var</code> to define the type of your children. Basically, you have a <code>variant</code> of <code>std::vector< AST_Node<ChildType<type_list_element>>></code> via a chaining of <code>MapTypes</code>. Heck you could bundle the <code>std::vector</code> of children and the <code>data</code> together into one variant. This would allow you to write an mapping for an individual <code>AST_Node</code> type (which makes to another <code>AST_Node</code> type), aggregate them all together and generate a <code>AST_Node<before, after></code> functor that would then walk over the tree. Some of the functors would operate on the data only then let the parent logic take over for children, some would transform entire subtrees, some would operate on the data and stop the parent logic from running over the children. That technique gets tricky, because you have to synthesize boost variant visitors from your individual functions in a way that doesn't require piling them all together. If you look here you'll see a few techniques on how to take a bunch of <code>std::function<T(U)></code> and turn them into one functor that takes any one of the union of <code>U</code>. Toss in some work to calculate the union of the return types (a simple <code>type_list</code> with duplicate types removed, then pumped into a <code>boost::variant</code>, might do it) -- such a "merged functor" would be a valid visitor. And now you can write "remap an AST node of type operator_add" functors, and "remap an AST node of type operator_mult", and a few others, tie them together into a mega-functor, throw them at an AST traversal algorithm, and have it spew out an AST tree with some types converted to other types... But that would be a lot of work. Oh, and we might want "phase tagging", where a phase 1 and phase 2 AST are different types. We could tag each type in the <code>type_list</code> with its phase, or we could just tag the <code>AST</code> tree itself. Heck, we could name the phases for the <code>AST</code> using otherwise unused <code>struct</code>s, and define a progression through phases as a type to type functor that gets applied and enforced in the signature of <code>astFunc<before_phase, before_types, after_phase, after_types></code>. So that isn't bad. We create a <code>type_list</code> of node types. These types need not be the actual data stored. But it can be. We create a <code>data_type</code> traits class that maps each node type to the data stored. We create a <code>child_types</code> traits class that maps each node type to the type_list of the child ASTs. Each <code>AST_Node</code> stores a <code>variant<AST_Concrete_Node<Ts>...></code>. <code>AST_Concrete_Node</code> contains a <code>DataType<T> data;</code> and a <code>MapTypes< std::vector, MapTypes< AST_Node, ChildTypes<T> > > children;</code> (aka, <code>std::vector< AST_Node<ChildrenTypes...> ></code>, but you cannot say that directly easily). Next, <code>AST_Concrete_Node<T></code> transformation functions are joined together in a tricky bit of template metaprogramming into boost variant visitors. This step is really tricky, but I think doable. Extra work is done so that unmentioned types are skipped, so we don't have to constantly say "oh, and we don't want to transform node X", but rather have to say "if we hit node Y, don't transform its kids". At this point, I'm going to say I am blathering on -- not having done this before, the problems encountered in a concrete implementation of this mess of type gymnastics is going to overwhelm my ability to abstractly reason about it. But the idea is hopefully useful -- we have type-safe node-type transformations which we aggregate together and generate a type-safe tree transformation. The tree isn't merely an abstract tree of universal variants, but an tree where each node knows what the types allowed in its children are, which recursively know the same. We could even handle "this must have exactly 3 children, the first of which is an <code>int</code>, the second is a <code>Bob</code>, and the third is a <code>double</code>" if we went far enough down the rabbit hole.

Representing a multiple pass Abstract Syntax Tree (AST) in C++?

Tags:

c++

compiler-construction

abstract-syntax-tree

parse-tree

I am currently exploring designing a compiler that transforms its AST in multiple stages. The idea is that starting from the parse tree, each pass transforms the tree until the resulting AST is optimised and contains all of the required information in each node of the tree required to generate the intermediate code (in this case LLVM IR). A pass over the tree may considerably change its structure, for example changing a list of operators and operands into a hierarchy of ordered operations via operator precedence parsing. Note that a pass may leave parts of the structure entirely unchanged.

So, my question is how do I best (read: most easily, with as little repetition as possible) represent an AST that has multiple intermediate representations in C++? I would like node types from each phase's version of the AST to respect their incompatibility at compile time. I believe that the key issue is how should I represent parts of the structure that do not change between passes while avoiding repetitive code? I imagine that this is a problem solved many times in the past by compiler authors.

Note that I am currently using Boost Variant instead of normal runtime polymorphism in my AST, and would like a solution to be compatible with it as well.

343

asked Apr 16 '13 17:04

Dylan

2 Answers

AST nodes by themselves don't need huge amounts of complexity. I think all this AST node machinery is just overkill.

The problem with ASTs isn't node type safety; its tree shape safety. An AST represents (presumably) some valid instance of some language L. What you ideally want is for transformations on the AST to produce other valid ASTs (instances of the language L). You're not going to guarantee that by guaranteeing that any one node has a valid type; you can only do it by guaranteeing that any tree patch produces a valid tree. And this is very difficult to do if the tree operations are atomic (e.g., "change node that", "replace child", "replace parent") and applied seperately; after several such steps, what precisely can you say about the tree?

This is better done using a kind of tree-rewrite transaction, e.g., source-to-source transformations whose grammatical structure is valid for language L, and which are applied in places that are valid for that transformation.

Most standard program transformation systems do this. They achieve this by holding a model of grammar for L, and checking that the proposed transforms are well-typed. This ensures that transformations of language L to language L stay well-formed.

This is harder to get right if the transformations map from one language A to another language B; if some such transformations are applied, you usually get a tree with mixed types that is not legal in either language. With care, one can define a set of transforms that map all subtrees of language A to language B, and apply them exhaustively; then you want the resulting tree to be well formed for B. You can ensure that by insisting whenever a B-patch is inserted in a mixed tree, if it is adjacent to another B-patch, that the resulting compound B-patch is well formed. This you can do using the same style of grammar checks.

Using these ideas, you can build a system that maps an AST through a series of "representations" (langauges A, B, C, ....) and have some faith that the result tree is well-shaped. This idea generalizes to graph rewrites.

198

answered Oct 07 '22 00:10

Ira Baxter

Here is a quick stab at a type-safe boost::variant based AST.

I included a simple "structure preserving transform" that simply changes the type of data stored in each AST node. In theory, however, you can write an arbitrary astFunc that both does a structural and data based transform of the nodes -- just write a type_list that contains the valid types in each node before and after.

template<typename... Ts>
struct type_list {};

// specialize data_type to store something special in your AST node:
// (by default, an entry means "the type of the data")
tempalte<typename T>
struct data_type { typedef T type; };
template<typename T>
using DataType = typename data_type<T>::type;

template<template<typename>class F, typename typelist>
struct map_types;
template<template<typename>class F, template<typename...>L, typename... Ts>
struct map_types<F, L<Ts...>> {
  typedef L< F<Ts>... > type;
};
template<template<typename>class F, typename typelist>
using MapTypes = typename map_types<F, typelist>::type;

template<template<typename...>class F, typename typelist>
struct apply_list;
template<template<typename...>class F, template<typename...>class L, typename... Ts>
struct apply_list<F, L<Ts...>> {
  typedef F<Ts...> type;
};
template<template<typename...>class F, typename typelist>
using ApplyList = typename apply_list<F, typelist>::type;
template<typename typelist>
using Var = ApplyList< boost::variant, MapTypes<DataType, typelist> >;

template<typename type_list>
struct AST_Node {
  typedef std::unique_ptr<AST_Node> upAST_Node;
  std::vector<upAST_Node> children;
  Var<type_list> data;
  template<typename T>
  AST_Node( T&& t ):data( std::forward<T>(t) ) {}
};
template<typename type_list>
using upAST_Node = typename AST_Node<type_list>::upAST_Node;

template<typename before_types, typename after_types>
using typeFunc = std::function< Var<after_types>(Var<before_types>) >;

template<typename before_types, typename after_types>
using astFunc = std::function< upAST_Node<after_types>(upAST_Node<before_types>) >;

template<typename before_types, typename after_types>
astFunc<before_types, after_types> elementWiseTransform( typeFunc<before_types, after_types> func ) {
  return [func]( upAST_Node<before_types> before )->upAST_Nodes<after_types> {
    upAST_Node<after_types> after( new AST_Node<after_types>( func( before ) ) );
    after->children.reserve( before->children.size() );
    for( auto& child: before->children ) {
      after->children.push_back( elementWiseTransform(func)(std::move(child)) );
    }
    return after;
  };
}

Now this is just a start.

You could go further, and have each type of node have a different set of types of children, or even a different number. Simply create traits classes for each type of node like my data_type, such as children_types. Then use a similar technique to how I defined Var to define the type of your children. Basically, you have a variant of std::vector< AST_Node<ChildType<type_list_element>>> via a chaining of MapTypes. Heck you could bundle the std::vector of children and the data together into one variant.

This would allow you to write an mapping for an individual AST_Node type (which makes to another AST_Node type), aggregate them all together and generate a AST_Node<before, after> functor that would then walk over the tree. Some of the functors would operate on the data only then let the parent logic take over for children, some would transform entire subtrees, some would operate on the data and stop the parent logic from running over the children.

That technique gets tricky, because you have to synthesize boost variant visitors from your individual functions in a way that doesn't require piling them all together. If you look here you'll see a few techniques on how to take a bunch of std::function<T(U)> and turn them into one functor that takes any one of the union of U. Toss in some work to calculate the union of the return types (a simple type_list with duplicate types removed, then pumped into a boost::variant, might do it) -- such a "merged functor" would be a valid visitor.

And now you can write "remap an AST node of type operator_add" functors, and "remap an AST node of type operator_mult", and a few others, tie them together into a mega-functor, throw them at an AST traversal algorithm, and have it spew out an AST tree with some types converted to other types...

But that would be a lot of work.

Oh, and we might want "phase tagging", where a phase 1 and phase 2 AST are different types. We could tag each type in the type_list with its phase, or we could just tag the AST tree itself. Heck, we could name the phases for the AST using otherwise unused structs, and define a progression through phases as a type to type functor that gets applied and enforced in the signature of astFunc<before_phase, before_types, after_phase, after_types>.

So that isn't bad. We create a type_list of node types. These types need not be the actual data stored. But it can be.

We create a data_type traits class that maps each node type to the data stored. We create a child_types traits class that maps each node type to the type_list of the child ASTs.

Each AST_Node stores a variant<AST_Concrete_Node<Ts>...>. AST_Concrete_Node contains a DataType<T> data; and a MapTypes< std::vector, MapTypes< AST_Node, ChildTypes<T> > > children; (aka, std::vector< AST_Node<ChildrenTypes...> >, but you cannot say that directly easily).

Next, AST_Concrete_Node<T> transformation functions are joined together in a tricky bit of template metaprogramming into boost variant visitors. This step is really tricky, but I think doable. Extra work is done so that unmentioned types are skipped, so we don't have to constantly say "oh, and we don't want to transform node X", but rather have to say "if we hit node Y, don't transform its kids".

At this point, I'm going to say I am blathering on -- not having done this before, the problems encountered in a concrete implementation of this mess of type gymnastics is going to overwhelm my ability to abstractly reason about it. But the idea is hopefully useful -- we have type-safe node-type transformations which we aggregate together and generate a type-safe tree transformation. The tree isn't merely an abstract tree of universal variants, but an tree where each node knows what the types allowed in its children are, which recursively know the same. We could even handle "this must have exactly 3 children, the first of which is an int, the second is a Bob, and the third is a double" if we went far enough down the rabbit hole.

answered Oct 07 '22 02:10

Yakk - Adam Nevraumont

Related questions
                            
                                Bug throwing exceptions with std::call_once
                            
                                Explain integer comparison with promotion
                            
                                Fast method to multiply integer by proper fraction without floats or overflow
                            
                                How can a const expr be evaluated so fast
                            
                                Should volatile still be used for sharing data with ISRs in modern C++?
                            
                                Is there an auto-update framework for C++/Win32/MFC (like Sparkle)? [closed]
                            
                                Dynamic slicing in C/C++
                            
                                C++ Mock/Test boost::asio::io_stream - based Asynch Handler
                            
                                c++ g++ llvm-clang compiler profiling
                            
                                How does linker know which symbols should be resolved at runtime?
                            
                                Automake Variables to tidy up Makefile.am
                            
                                Real world example of dynamic_cast in C++
                            
                                Sequence iterator? Isn't there one in boost?
                            
                                replace c++ with go + swig
                            
                                Smart pointers and dynamic_cast
                            
                                printing very large floating point numbers
                            
                                Length prefix for protobuf messages in C++
                            
                                Is it well-defined behaviour to exit the program before main?
                            
                                How can I catch exceptions thrown from constructor of a global instance of a class?
                            
                                C++ std::set why is it associative?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With