Constructing an Abstract Syntax Tree with a list of Tokens

Tags:

I want to construct an AST from a list of tokens. I'm making a scripting language and I've already done the lexical analysis part, but I have no idea how to create an AST. So the question is, how do I take something like this:

WORD, int WORD, x SYMBOL, = NUMBER, 5 SYMBOL, ;

and convert it into an Abstract Syntax Tree? Preferably, I'd like to do so without a library like ANTLR or whatever, I'd rather try and do it from scratch myself. However, if it's a really complex task, I don't mind using a library :) Thanks

556

asked Jul 31 '14 02:07

metro-man

1 Answers

The fundamental trick is to recognize that parsing, however accomplished, happens in incremental steps, including the reading of the tokens one by one.

At each incremental step, there is an opportunity to build part of the AST by combining AST fragments built by other incremental steps. This is a recursive idea, and it bottoms out in building AST leaf nodes for tokens as they are scanned. This basic idea occurs in pretty much all AST-building parsers.

If one builds a recursive descent parser, one in effect builds a cooperating system of recursive procedures, each one of which recognizes a nonterminal in whatever grammar is being implemented. For pure parsing, each procedure simply returns a boolean for "nonterminal (not) recognized".

To build an AST with a recursive descent parser, one designs these procedures to return two values: the boolean "recognized", and, if recognized, an AST constructed (somehow) for the nonterminal. (A common hack is return a pointer, which is void for "not recognized", or points to the constructed AST if "recognized"). The way the resulting AST for a single procedure is built, is by combining the ASTs from the sub-procedures that it invokes. This is pretty trivial to do for leaf procedures, which read an input token and can immediately build a tree.

The downside to all this is one must manually code the recursive descent, and augment it with the tree building steps. In the grand scheme of things, this is actually pretty easy to code for small grammars.

For OP's example, assume we have this grammar:

GOAL = ASSIGNMENT  ASSIGNMENT = LHS '=' RHS ';'  LHS = IDENTIFIER  RHS = IDENTIFIER | NUMBER

OK, our recursive descent parser:

boolean parse_Goal() {  if parse_Assignement()    then return true    else return false }  boolean parse_Assignment() {  if not Parse_LHS()    then return false    if not Parse_equalsign()    then throw SyntaxError // because there are no viable alternatives from here    if not Parse_RHS()    then throw SyntaxError    if not Parse_semicolon()    the throw SyntaxError    return true }  boolean parse_LHS() {  if parse_IDENTIFIER()    then return true    else return false }  boolean parse_RHS() {  if parse_IDENTIFIER()    then return true    if parse_NUMBER()    then return true    else return false }  boolean parse_equalsign() {  if TestInputAndAdvance("=")  // this can check for token instead    then return true    else return false }  boolean parse_semicolon() {  if TestInputAndAdvance(";")    then return true    else return false }  boolean parse_IDENTIFIER() {  if TestInputForIdentifier()    then return true    else return false }  boolean parse_NUMBER() {  if TestInputForNumber()    then return true    else return false }

Now, let's revise it build a abstract syntax tree:

AST* parse_Goal() // note: we choose to return a null pointer for "false" {  node = parse_Assignment()    if node != NULL    then return node    else return NULL }  AST* parse_Assignment() {  LHSnode = Parse_LHS()    if LHSnode == NULL    then return NULL    EqualNode = Parse_equalsign()    if EqualNode == NULL    then throw SyntaxError // because there are no viable alternatives from here    RHSnode = Parse_RHS()    if RHSnode == NULL    then throw SyntaxError    SemicolonNode = Parse_semicolon()    if SemicolonNode == NULL    the throw SyntaxError    return makeASTNode(ASSIGNMENT,LHSNode,RHSNode) }  AST* parse_LHS() {  IdentifierNode = parse_IDENTIFIER()    if node != NULL    then return IdentifierNode    else return NULL }  AST* parse_RHS() {  RHSnode = parse_IDENTIFIER()    if RHSnode != null    then return RHSnode    RHSnode = parse_NUMBER()    if RHSnode != null    then return RHSnode    else return NULL }  AST* parse_equalsign() {  if TestInputAndAdvance("=")  // this can check for token instead    then return makeASTNode("=")    else return NULL }  AST* parse_semicolon() {  if TestInputAndAdvance(";")    then return makeASTNode(";")    else return NULL }  AST* parse_IDENTIFIER() {  text = TestInputForIdentifier()    if text != NULL    then return makeASTNode("IDENTIFIER",text)    else return NULL }  AST* parse_NUMBER() {  text = TestInputForNumber()    if text != NULL    then return makeASTNode("NUMBER",text)    else return NULL }

I've obviously glossed over some details, but I assume the reader will have no trouble filling them in.

Parser generator tools like JavaCC and ANTLR basically generate recursive descent parsers, and have facilities for constructing trees that work very much like this.

Parser generator tools that build bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there is no set of recursive functions; instead, a stack of tokens seen and reduced-to nonterminals is managed by these tools. The AST nodes are constructed on a parallel stack; when a reduction occurs, the AST nodes on the part of the stack covered by the reduction are combined to produce a nonterminal AST node to replace them. This happens with "zero-size" stack segments for grammar rules which are empty too causing AST nodes (typically for 'empty list' or 'missing option') to seemingly appear from nowhere.

With bitty languages, writing recursive-descent parsers that build trees is pretty practical.

A problem with real languages (whether old and hoary like COBOL or hot and shiny like Scala) is that the number of grammar rules is pretty large, complicated by the sophistication of the language and the insistence on whatever language committee is in charge of it to perpetually add new goodies offered by other languages ("language envy", see the evolutionary race between Java, C# and C++). Now writing a recursive descent parser gets way out of hand and one tends to use parser generators. But even with a parser generator, writing all the custom code to build AST nodes is also a big battle (and we haven't discussed what it takes to design a good "abstract" syntax vs. the first thing that comes to mind). Maintaining grammar rules and AST building goo gets progressively harder with scale and ongoing evolution. (If your language is successful, within a year you'll want to change it). So even writing the AST building rules gets awkward.

Ideally, one would just like to write a grammar, and get a parser and tree. You can do this with some recent parser generators: Our DMS Software Reengineering Toolkit accepts full context free grammars, and automatically constructs an AST, no work on the grammar engineer's part; its been doing this since 1995. The ANTLR guys finally figured this out in 2014, and ANTLR4 now offers an option like this.

Last point: having a parser (even with an AST) is hardly a solution to the actual problem you set out to solve, whatever it was. Its just a foundation piece, and much to the shock for most parser-newbies, it is the smallest part to a tool that manipulates code. Google my essay on Life After Parsing (or check my bio) for more detail.

answered Sep 20 '22 05:09

Ira Baxter

Related questions
                            
                                Where can I find a Java to C# converter? [closed]
                            
                                What is method inlining? [duplicate]
                            
                                Observable which does not pass anything in onNext()
                            
                                Handling the null value from a resultset
                            
                                How can I do unit test for hashCode()?
                            
                                How to use <spring:message> inside an attribute of <form:input>?
                            
                                sort list by date in descending order - groovy madness
                            
                                How to add elements in List when used Arrays.asList()
                            
                                With android studio no jvm found, JAVA_HOME has been set
                            
                                How to pass a JSON array as a parameter in URL
                            
                                How to set formulas in cells using Apache POI?
                            
                                JodaTime equivalent of DateUtils.truncate()
                            
                                JUnit test — analysing expected Exceptions [duplicate]
                            
                                Java and Clojure with Leiningen
                            
                                Android Color Picker [closed]
                            
                                Using java map for range searches
                            
                                Google Protocol Buffers and HTTP
                            
                                How to open warning/information/error dialog in Swing?
                            
                                How can I make a field non-serializable in java?
                            
                                How to check if an excel cell is empty using Apache POI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Constructing an Abstract Syntax Tree with a list of Tokens

Tags:

java

interpreter

abstract-syntax-tree

metro-man

People also ask

1 Answers

Ira Baxter

Recent Activity

Donate For Us