Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate an AST in C++

I'm making an interpreter in C++, so far I've got my lexer to generate tokens. The problem is I'm not sure how to generate an "walk" a parse tree.

I was thinking of making my parse tree using an array of arrays, but I'm not sure how to actually insert the tokens into the parse tree in the correct order.

I'm not sure whether or not to go top-down, left-right or bottom-up, right-left.

Can anyone provide me with a simple LALR(1) algorithm?

like image 521
Francis Avatar asked Feb 15 '15 19:02

Francis


People also ask

What is AST in C?

An Abstract Syntax Tree (AST) abstracts out certain details, and retains just enough information to help the compiler understand the structure of the code. Therefore, an AST is a tree data structure that best represents the syntactic structure of the source code.

What Is syntax tree generation example?

A syntax tree is a tree in which each leaf node represents an operand, while each inside node represents an operator. The Parse Tree is abbreviated as the syntax tree. The syntax tree is usually used when representing a program in a tree structure.

What is AST in R?

Section 18.2 introduces the idea of the abstract syntax tree (AST), and reveals the tree like structure that underlies all R code. Section 18.3 dives into the details of the data structures that underpin the AST: constants, symbols, and calls, which are collectively known as expressions.

How do you define AST in Python?

For is a class defined in the ast module that expresses a for loop in Python in the form of an Abstract Syntax Tree. When the parse() method of ast is called on a Python source code that contains for loops, the ast. For class is invoked, which expresses the for statement to a node in an ast tree data structure.


1 Answers

I'm going to go against conventional wisdom here and say that you should build your AST manually with natural language-specific data-structures.

A generic "AST data structure" will be too general to work with comfortably -- the code that consumes the AST to do anything useful with it will be obscured by the workarounds to access the data it wants. If you go this route (or use a parser generator), you'll likely end up building a translation layer to go from the generic structure to an AST that actually makes sense for your language. Why not avoid the middleman and build the final data structure directly?

I suggest writing a first syntactic pass that consumes the tokens it needs for each possible construct (probably peeking ahead one token as needed). This syntactic pass would construct the AST out of linked instances of data structures that represent each possible construct in your language. For example, if your program can consist of a series of statements, where each statement can be either a function declaration or a function call, you could create the following data structures:

AST (struct)
    -> std::vector<Statement*> statements

StatementKind (enum class)
    -> Function
    -> Call

Statement (struct)
    -> StatementKind kind

Function : Statement (struct)
    -> std::string name
    -> std::vector<Statement*> body

Call : Statement (struct)
    -> std::string name
    -> Function* called

Then the syntactic pass to build the initial AST might look like:

auto ast = parseAST();

where parseAST calls parseStatement repeatedly, which consumes and/or peeks at tokens to determine whether the statement is a function definition or a function call, then calls parseFunction or parseCall appropriately. This is called a hand-written recursive descent parser, and is in my experience by far the simplest and best type of parser you can write. Yes there is boilerplate to write, but it's also very clear and flexible code.

The syntactic phase generates syntax error messages as it goes, attempting to recover as best as it can when an error is found (this is one argument for semicolon-delimited languages -- the compilers and tools can often recover from errors much more easily, since it often suffices to skip to the next semicolon when an error is encountered before trying to parse the next construct).

If a function is allowed to be called before it is defined, this is simple to handle too -- just parse the call part without checking if a function with that name exists at the time you first parse it, then correlate it later.

A second, semantic, pass through the AST would annotate it with any missing cross-referenced data (e.g. resolving function calls' function names with those functions' definitions, or generating an error if the name is not found). Once that's done, you have a full AST that's guaranteed to represent a valid program (since you did all the semantic error checking in this pass), and can have optimizations applied to it, followed by the code generation phase (and more optimizations along the way).

Note that I completely left out any mention of LL or LR parsers, etc. The theoretical notation and categorization of your language is important (as it can prove it's non-ambiguous, for example), but from an implementation point of view, it's far more important to have clean, easily understood and above all easily modifiable code than to adhere to a specific parsing mechanism. Examples of other compilers that use hand-written parsers include GCC, Clang, and Google's V8, among others.

In terms of efficiency, the AST is generally iterated over several times after it's constructed, and needs to stick around in memory until late in the process (possibly right until the end, depending on how you do your code generation), and so it's best to use an object pool to allocate the records for your AST than to dynamically allocate everything separately on the heap. Finally, in place of strings, it's generally better to use an offset-and-length into the original source code.

like image 68
Cameron Avatar answered Sep 20 '22 06:09

Cameron