Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AST traversal in visitor or in the nodes?

Update accepted Ira Baxter's answer since it pointed me into the right direction: I first figured out what I actually needed by starting the implementation of the compiling stage, and it became obvious pretty soon that traversal within the nodes made thie an impossible approach. Not all nodes should be visited, and some of them in reverse order (for example, first the rhs of an assignment so the compiler can check if the type matches with the rhs/operator). Putting traversal in the visitor makes this all very easy.

I'm playing around with ASTs and the likes before deciding a major refactory of the handling of a mini-language used in an applicaiton. I've built a Lexer/Parser and can get the AST just fine. There's also a Visitor and as concrete implementation I made an ASTToOriginal which just recreates the original source file. Eventually there's goin to be some sort of compiler that also implements the Vsisitor and creates the actual C++ code at runtime so I want to make sure everything is right from the start. While everything works fine now, there is some similar/duplicate code since the traversal order is implemented in the Visitor itself.

When looking up more information, it seems that some implementations prefer keeping the traversal order in the visited objects themselves instead, in order not to repeat this in each concrete visitor. Even the GoF only talks briefly about this, in the same way. So I wanted to give this approach a try as well but got stuck pretty soon.. Let me explain.

Sample source line and corresponding AST nodes:

if(t>100?x=1;sety(20,true):x=2)
Conditional
  BinaryOp
    left=Variable [name=t], operator=[>], right=Integer [value=100]
  IfTrue
    Assignment
      left=Variable [name=x], operator=[=], right=Integer [value=1] 
    Method
      MethodName [name=sety], Arguments( Integer [value=20], Boolean [value=true] )
  IfFalse
    Assignment
      left=Variable [name=x], operator=[=], right=Integer [value=1]

Some code:

class BinaryOp {
  void Accept( Visitor* v ){ v->Visit( this ); }
  Expr* left;
  Op* op;
  Expr* right;
};    
class Variable {
  void Accept( Visitor* v ){ v->Visit( this ); }
  Name* name;
};
class Visitor { //provide basic traversal, terminal visitors are abstract
  void Visit( Variable* ) = 0;
  void Visit( BinaryOp* p ) {
    p->left->Accept( this );
    p->op->Accept( this );
    p->right->Accept( this );        
  }
  void Visit( Conditional* p ) {
    p->cond->Accept( this );
    VisitList( p->ifTrue ); //VisitList just iterates over the array, calling Accept on each element
    VisitList( p->ifFalse );
  }
};

Implementing ASTToOriginal is pretty straightforward: all abstract Visitor methods just print out the name or value member of the terminal. For the non-terminals it depends; printing an Assignment works ok with the default Visitor traversal, for a Conditional extra code is needed:

class ASTToOriginal {
  void Visit( Conditional* p ) {
    str << "if(";
    p->cond->Accept( this );
    str << "?";
      //VisitListWithPostOp is like VisitList but calls op for each *except the last* iteration
    VisitListWithPostOp( p->ifTrue, AppendText( str, ";" ) );
    VisitListWithPostOp( p->ifFalse, AppendText( str, ";" ) );
    str << ")";
  }
};

So as one can see both the Visit methods for a Conditional in Visitor and ASTToOriginal are indeed very similar. However trying to solve this by putting traversal into the nodes made things not just worse, but rather a complete mess. I tried an approach with PreVisit and PostVisit methods which solved some problems, but just introduced more and more code into the Nodes. It also started to look like I would have to keep track of a number of states inside the Visitor to be able to know when to add closing brackets etc.

class BinaryOp {
  void Accept( Conditional* v ) { 
    v->Visit( this );
    op->Accept( v )
    VisitList( ifTrue, v );
    VisitList( ifFalse, v );
};
class Vistor {
  //now all methods are pure virtual
};
class ASTToOriginal {
  void Visit( Conditional* p ) {
    str << "if(";
    //now what??? after returning here, BinaryOp will visit the op automatically so I can't insert the "?"
    //If I make a PostVisit( BinaryOp* ), and call it it BinaryOp::Accept, I get the chance to insert the "?",
    //but now I have to keep a state: my PostVisit method needs to know it's currently being called as part of a Conditional
    //Things are even worse for the ifTrue/ifFalse statement arrays: each element needs a ";" appended, but not the last one,
    //how am I ever going to do that in a clean way?
  }
};

Question: is this approach just not suited for my case, or am I overlooking something essential? Is there a common design to cope with these problems? What if I also need traversal in a different direction?

like image 747
stijn Avatar asked Feb 25 '11 17:02

stijn


2 Answers

There are two problems:

  • What children nodes are possible to visit?
  • What order should you visit them?

Arguably the actual children nodes should be known by node type; actually, it should be known by the grammar, and "reflected" from the grammar into the general visitor.

The order in which the nodes are visited depend completely on what you need to do. If you are doing prettyprinting, a left-to-right child order makes sense (if the children nodes are listed in the order of the grammar, which they might not be). If you are constructing symbol tables, you surely want to visit the declaration children before you visit the statement body child.

In addition, you need to worry about what information flows up or down the tree. A list-of-variable accesses would flow up the tree. A constructed symbol table flows up the tree from the declarations, and back down into statement body child. And this information flow forces the visit order; to have a symbol table to pass down into the statement body, you first have to have a symbol table constructed and passed up from the declarations child.

I think these issues are the ones that are giving you greif. You are trying to impose a single structure on your visitors, when in fact the visit order is completely task dependent, and there are lots of different tasks you might do with a tree, each with its own information flow and thus order dependency.

One of the ways this is solved is with the notion of an attribute(d) grammar (AG), in which one decorates the grammar rules with various types of attributes and how they are computed/used. You literally write a computation as an annotation to the grammar rule, for instance:

  method = declarations statements ;
  <<ResolveSymbols>>: { declarations.parentsymbols=method.symboltable;
                        statements.symboltable = declarations.symboltable;
                      }

The grammar rule tells you what node types you have to have. The atrribute computation tells you what value are being passed down the tree (reference to method.symboltable is to something coming form the parent), up the tree (reference to declarations.symbol table is to something computed by that child), or across the tree (statements.symboltable is passed down to the statements child, from the value computed by declarations.symboltable). The attribute computation defines the visitor. The executed computation is a called "attribute evaluation".

This notation for this particular attribute grammar is part of our DMS Software Reengineering Toolkit. Other AG tools use similar notations. Like all (AG) schemes, the particular rule is used to manufacture the purpose-specific ("ResolveSymbols") visitor for the specific node ("method"). With a set of such specifications for each node, you get a set of purpose-specific visitors that can be executed. The value of an AG scheme is that you can write this easily and all the boilerplate gunk gets generated.

You can think about your problem in the abstract like this, and then simply generate the purpose-specific visitors by hand, as you have been doing.

like image 77
Ira Baxter Avatar answered Nov 09 '22 08:11

Ira Baxter


For recursive generic traversing of trees, Visitor and Composite are usually used together, like in (first relevant google link) there. I first read about this idea there. There are also visitor combinators which are a nice idea.

And by the way...

this is where functional languages shine, with their Algebraic Data Types and pattern matching. If you can, switch to a functional language. Composite and Visitor are only ugly workarounds for lack of language support for respectively ADT and pattern matching.

like image 31
Alexandre C. Avatar answered Nov 09 '22 10:11

Alexandre C.