If you were constructing a compiler, what optimization at the AST level would be the nicest to have?
An Abstract Syntax Tree, or AST, is a tree representation of the source code of a computer program that conveys the structure of the source code. Each node in the tree represents a construct occurring in the source code.
Once we have an Abstract Syntax Tree we can both manipulate it as well as "print" it into a different type of code. Using ASTs to manipulate code is safer than doing those operations directly on the code as text or on a list of tokens. Manipulating text is always dangerous; it shows the least amount of context.
The Abstract Syntax Tree is generated using both the list of tokens (from the lexical analysis) and the source code. The AST is generated during the syntax analysis stage of the compilation. Any syntax error would be detected and a syntax error message would then be returned, stopping the compilation process.
A parse tree is a record of the rules (and tokens) used to match some input text whereas a syntax tree records the structure of the input and is insensitive to the grammar that produced it. Combining the above two definitions, An Abstract Syntax Tree describes the parse tree logically.
Mostly you can't do interesting optimizations at the AST level, because you need information how how data flows from one part of the program to another. While data flow is implicit in the meaning of the AST, it isn't easily determined by inspecting just the AST, which is why people building compilers and optimizers build other program representations (including symbol tables, control flow graphs, reaching definitions, data flow and SSA forms, etc.).
Having a parser for a language is the easy part of analyzing/manipulating that language. You need all that other stuff to do a good job.
If you do have all those other representations, you can think about doing optimizations at the AST level. Most folks building compilers don't bother; they convert to a data flow representation and simply optimize that. But if you want to reproduce source code with changes, you need the AST. You'll also need a prettyprinter to enable you to regenerate the source code. If you go this far, you'll end up with a source-to-source program transformation system.
The DMS Software Reengineering Toolkit is a system that transforms ASTs, using all these other representations to enable the analyses needed by the transforms.
An optimisation that is easiest to do on the AST (rather than, say, the CFG) is tail-call optimisation: if you see a subtree of the form:
RETURN
CALL f
ARGS x, y, ...
You can replace it with a jump to f
. If f(a, b)
is the function that the tail-call appears in, the replacement is as simple as:
a = x; b = y
JUMP to root of tree
I find it easiest to represent that jump as a special "restart" statement, which the AST->CFG translation treats as an edge back to the first node. Jumping to other functions is a bit trickier since you can't just set local variables, you need to actually think ahead how arguments are passed to them and prepare to translate this at a lower level. For example, the AST might need a special node that can instruct a later pass to overwrite the current stack frame with the arguments and jump accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With