Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Constructing an Abstract Syntax Tree with a list of Tokens

I want to construct an AST from a list of tokens. I'm making a scripting language and I've already done the lexical analysis part, but I have no idea how to create an AST. So the question is, how do I take something like this:

WORD, int WORD, x SYMBOL, = NUMBER, 5 SYMBOL, ; 

and convert it into an Abstract Syntax Tree? Preferably, I'd like to do so without a library like ANTLR or whatever, I'd rather try and do it from scratch myself. However, if it's a really complex task, I don't mind using a library :) Thanks

like image 556
metro-man Avatar asked Jul 31 '14 02:07

metro-man


People also ask

How do you create a syntax tree?

Rules for constructing a syntax treeEach node in a syntax tree can be executed as data with multiple fields. In the node for an operator, one field recognizes the operator and the remaining field includes a pointer to the nodes for the operands. The operator is known as the label of the node.

How would you define an abstract syntax tree?

In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring in the text.


1 Answers

The fundamental trick is to recognize that parsing, however accomplished, happens in incremental steps, including the reading of the tokens one by one.

At each incremental step, there is an opportunity to build part of the AST by combining AST fragments built by other incremental steps. This is a recursive idea, and it bottoms out in building AST leaf nodes for tokens as they are scanned. This basic idea occurs in pretty much all AST-building parsers.

If one builds a recursive descent parser, one in effect builds a cooperating system of recursive procedures, each one of which recognizes a nonterminal in whatever grammar is being implemented. For pure parsing, each procedure simply returns a boolean for "nonterminal (not) recognized".

To build an AST with a recursive descent parser, one designs these procedures to return two values: the boolean "recognized", and, if recognized, an AST constructed (somehow) for the nonterminal. (A common hack is return a pointer, which is void for "not recognized", or points to the constructed AST if "recognized"). The way the resulting AST for a single procedure is built, is by combining the ASTs from the sub-procedures that it invokes. This is pretty trivial to do for leaf procedures, which read an input token and can immediately build a tree.

The downside to all this is one must manually code the recursive descent, and augment it with the tree building steps. In the grand scheme of things, this is actually pretty easy to code for small grammars.

For OP's example, assume we have this grammar:

GOAL = ASSIGNMENT  ASSIGNMENT = LHS '=' RHS ';'  LHS = IDENTIFIER  RHS = IDENTIFIER | NUMBER 

OK, our recursive descent parser:

boolean parse_Goal() {  if parse_Assignement()    then return true    else return false }  boolean parse_Assignment() {  if not Parse_LHS()    then return false    if not Parse_equalsign()    then throw SyntaxError // because there are no viable alternatives from here    if not Parse_RHS()    then throw SyntaxError    if not Parse_semicolon()    the throw SyntaxError    return true }  boolean parse_LHS() {  if parse_IDENTIFIER()    then return true    else return false }  boolean parse_RHS() {  if parse_IDENTIFIER()    then return true    if parse_NUMBER()    then return true    else return false }  boolean parse_equalsign() {  if TestInputAndAdvance("=")  // this can check for token instead    then return true    else return false }  boolean parse_semicolon() {  if TestInputAndAdvance(";")    then return true    else return false }  boolean parse_IDENTIFIER() {  if TestInputForIdentifier()    then return true    else return false }  boolean parse_NUMBER() {  if TestInputForNumber()    then return true    else return false } 

Now, let's revise it build a abstract syntax tree:

AST* parse_Goal() // note: we choose to return a null pointer for "false" {  node = parse_Assignment()    if node != NULL    then return node    else return NULL }  AST* parse_Assignment() {  LHSnode = Parse_LHS()    if LHSnode == NULL    then return NULL    EqualNode = Parse_equalsign()    if EqualNode == NULL    then throw SyntaxError // because there are no viable alternatives from here    RHSnode = Parse_RHS()    if RHSnode == NULL    then throw SyntaxError    SemicolonNode = Parse_semicolon()    if SemicolonNode == NULL    the throw SyntaxError    return makeASTNode(ASSIGNMENT,LHSNode,RHSNode) }  AST* parse_LHS() {  IdentifierNode = parse_IDENTIFIER()    if node != NULL    then return IdentifierNode    else return NULL }  AST* parse_RHS() {  RHSnode = parse_IDENTIFIER()    if RHSnode != null    then return RHSnode    RHSnode = parse_NUMBER()    if RHSnode != null    then return RHSnode    else return NULL }  AST* parse_equalsign() {  if TestInputAndAdvance("=")  // this can check for token instead    then return makeASTNode("=")    else return NULL }  AST* parse_semicolon() {  if TestInputAndAdvance(";")    then return makeASTNode(";")    else return NULL }  AST* parse_IDENTIFIER() {  text = TestInputForIdentifier()    if text != NULL    then return makeASTNode("IDENTIFIER",text)    else return NULL }  AST* parse_NUMBER() {  text = TestInputForNumber()    if text != NULL    then return makeASTNode("NUMBER",text)    else return NULL } 

I've obviously glossed over some details, but I assume the reader will have no trouble filling them in.

Parser generator tools like JavaCC and ANTLR basically generate recursive descent parsers, and have facilities for constructing trees that work very much like this.

Parser generator tools that build bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there is no set of recursive functions; instead, a stack of tokens seen and reduced-to nonterminals is managed by these tools. The AST nodes are constructed on a parallel stack; when a reduction occurs, the AST nodes on the part of the stack covered by the reduction are combined to produce a nonterminal AST node to replace them. This happens with "zero-size" stack segments for grammar rules which are empty too causing AST nodes (typically for 'empty list' or 'missing option') to seemingly appear from nowhere.

With bitty languages, writing recursive-descent parsers that build trees is pretty practical.

A problem with real languages (whether old and hoary like COBOL or hot and shiny like Scala) is that the number of grammar rules is pretty large, complicated by the sophistication of the language and the insistence on whatever language committee is in charge of it to perpetually add new goodies offered by other languages ("language envy", see the evolutionary race between Java, C# and C++). Now writing a recursive descent parser gets way out of hand and one tends to use parser generators. But even with a parser generator, writing all the custom code to build AST nodes is also a big battle (and we haven't discussed what it takes to design a good "abstract" syntax vs. the first thing that comes to mind). Maintaining grammar rules and AST building goo gets progressively harder with scale and ongoing evolution. (If your language is successful, within a year you'll want to change it). So even writing the AST building rules gets awkward.

Ideally, one would just like to write a grammar, and get a parser and tree. You can do this with some recent parser generators: Our DMS Software Reengineering Toolkit accepts full context free grammars, and automatically constructs an AST, no work on the grammar engineer's part; its been doing this since 1995. The ANTLR guys finally figured this out in 2014, and ANTLR4 now offers an option like this.

Last point: having a parser (even with an AST) is hardly a solution to the actual problem you set out to solve, whatever it was. Its just a foundation piece, and much to the shock for most parser-newbies, it is the smallest part to a tool that manipulates code. Google my essay on Life After Parsing (or check my bio) for more detail.

like image 60
Ira Baxter Avatar answered Sep 20 '22 05:09

Ira Baxter