Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard format for concrete and abstract syntax trees

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.

Is there a standard format for describing either concrete or abstract syntax trees? Do any widely-used tool chains support outputting to these formats?

I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.

I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.

Thanks!

like image 808
Brandon Bloom Avatar asked Feb 17 '09 09:02

Brandon Bloom


People also ask

What is abstract and concrete syntax tree?

CST(Concrete Syntax Tree) is a tree representation of the Grammar(Rules of how the program should be written). Depending on compiler architecture, it can be used by the Parser to produce an AST. AST(Abstract Syntax Tree) is a tree representation of Parsed source, produced by the Parser part of the compiler.

How do you syntax an abstract tree?

Abstract Syntax Tree is a kind of tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.

Is abstract syntax tree and syntax tree same?

In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring in the text.

What is concrete syntax tree in compiler design?

Originally, Concrete Syntax Tree (CST) is used for representation of a source code. CST represents concrete source code elements attached to corresponding construction in language syntax.


1 Answers

The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another. See http://www.gupro.de/GXL

More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees. See http://www.omg.org/spec/ASTM/1.0/Beta1/

This problem seems to get solved over and over again. There's half a dozen "tool bus" proposals made over the years that all solved it, with no one ever overtaking the industry. The problem is that a) it is easy to represent ASTs using any kind of nestable notation [parentheses like LISP, like XML, ...] so people roll their own solution easily, and b) for one tool to exchange an AST with another, they both have to agree essentially on what the AST nodes mean; but most ASTs are rather accidentally derived from the particular grammar/parsing technology used by each tool, and there's almost always disagreement about that between tools. So, I've seen very few tools that exchange ASTs meaningfully.

If you're doing a hobby thing, I'd stick with a lisp-like encoding of trees, where each node has the following format: ( ... ) Its easy to generate, and easy to read.

I work on a professional tool to manipulate programs. If we have print out the AST, we do the above. Mostly individual ASTs are far too complicated to look at in practice, so we hardly ever print out the entire AST, at best only a node and a few children deep. Our tool doesn't exchange ASTs with anybody (see above reasons :) but does just fine building it in memory, doing whizzy things with it for analysis reasons or transformation reasons, and then either just deleteing it (no need to send it anywhere) or regenerating the original language text from the tree. [The latter means you need anti-parsing or "prettyprinting" technology]

like image 80
Ira Baxter Avatar answered Oct 08 '22 18:10

Ira Baxter