Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use an abstract or concrete syntax tree?

I've been doing research on compilers. The lexer seems to be very straight forward: Take a "sentence" and break it up into words (or tokens). To ensure correct grammar a parser is needed. The parser generally takes the tokens and builds a tree that results in a root node (words into sentences, paragraphs, pages, etc...).

From this question it would seem a parser would build an AST. The AST only contains what is necessary to execute the code, so things like parentheses would be unnecessary since operator precedence is built into an AST. An AST is probably all a compiler needs.

But what about converting code from one language to another? Taking a made-up language (grammar) or an existing grammar and converting it into another where operator precedence rules may or may not be different? Is operator precedence "built in" to CST as well?

As an example lets say I made up a language and wanted to translate it into PHP code. The ternary operator on most languages has right-to-left associativity. PHP incorrectly uses left-to-right associativity (see more about this here). I want "my language" to use right-to-left but the resulting PHP code has to apply parenthesis to get the correct result in PHP (with the link to Wikipedia, result needs to be "train" instead of "horse").

So for language translation would a CST be better? Is operator precedence usually built into a CST? Is there anything in-between? Are there any examples comparing both trees with a simple algebra equation? Any examples illustrating a ternary operator?

(Is "transcode" the correct term for "programming language translation"? A Google search brings up converting media.)

What I'm trying to figure out is: When is it more appropriate to use one over the other?

like image 817
Luke Avatar asked Feb 26 '12 19:02

Luke


1 Answers

An AST that models all the semantic details of the source language is all you need. By definition, if it does model the semantics correctly, and your langauge includes a ternary operator, then it will model the specific order in which the operators are applied (e.g, the results of the predecence modulo overrides such as parentheses) correctly too.

So your problem isn't in the AST. It is generating to another language using similar (ternary) operators whose precedence is different.

This is an age-old problem in code generation: the operators of the target don't quite match the operators of the source, and so the output can't be one-to-one. In your case, you should be able to solve the problem by generating PHP ternary operators with parentheses around them to control the order to achieve what the original semantics are, so this isn't a big problem.

In general, generating sequences of code that achieve a desired result can be pretty complicated, and there's lots of ways to do it. That's why the compiler books are thick rather than thin. You seem to have implicitly settled on "get AST, walk AST, spit code"; this is almost an on-the-fly code generator. And this works adequately if you don't care if the generated code is particularly good, and the target language is pretty close to the source language.

If the code generation problem is more complex, what normally happens is the AST is used to generate what amounts to a data flow model of the computation, consisting of operators that produce results, and consume results from previous operators, grounding out in "operators" that fetch variable values and constants. Then the data flow representation is traversed to generate the code; this has the advantage that you can choose an operator in the data flow representation, find a matching code sequence in the target language, generate that, and then worry about how the operands are collected. Better schemes match data flow subgraphs (representing equivalent compound target language constructs) to the produced data flow graph; this can produce significantly better code. Often, one can apply target-language specific optimizations after raw code generation to produce even better code. In both cases, you have to worry about managing the operator results; can they be fed directly to the next target language operator, or must they go into some kind of temporary storage (for machine code, this can be another register or a memory location). Doing all this isn't easy; again, that's why the compiler books aren't thin.

A variation on this idea is source-to-source program transformations. This maps constructs in source code "directly" to constructs in target code, although this is usually done behind the scenes by operating on ASTs because unparsed programming language text is hard to match. Our DMS Software Reengineering Toolkit is an example of this kind of system. With such a tool, you write patterns in the source language (which implicitly match against a parse tree), and corresponding patterns in the target langauge (implicitly producing target language ASTs). You can write complex source or target constructs giving much of the effect of the data flow graph matching above. Post-generation optimization consists of more rewrite rules that transform the target code to target code.

Bottom line: Having an AST isn't really enough unless your translation is truly trivial. You can read more about what you do need at this SO answer: https://stackoverflow.com/a/3460977/120163

WARNING: Loud opinion follows.

Re "Transcoder": I much prefer the term "compilation", "translation", or "source-to-source" compiler. I've been building program analysis and manipulation tools for almost 40 years. I'd never heard the term "transcoder" until I encountered this SO question: Experience migrating legacy Cobol/PL1 to Java and a response describing IMHO a truly awful code translation scheme called NACA. I've heard since that this term is gaining traction; I don't see why we had to invent another term when we have perfectly adequate ones. Usually this is a sign of somebody inventing a high priesthood; "let's invent a shiny new term so people don't really understand what we're doing". I'm happy to leave that term to such truly awful translations.

like image 63
Ira Baxter Avatar answered Sep 27 '22 21:09

Ira Baxter