Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we use intermediate languages instead of AST?

What is the difference between an intermediate language and an AST? As far as I can tell, they both offer flow analysis information which a compiler can use for optimization purposes. I know GCC uses two intermediate representations - an AST and an IL. What is the reason for this?

like image 210
RouteMapper Avatar asked Dec 03 '13 15:12

RouteMapper


3 Answers

GCC is using far more than just two intermediate representations, and far less than it should have used.

There is a compiler design methodology, known as "nanopass": a compiler is built of a sequence of very simple code rewrites, starting with an original AST produced by a parser and ending up in a low level code. Each transform is trivial, and difference between neighbouring intermediate languages is subtle.

This way it is easy to reason about each of the transforms, easy to comprehend the whole chain and easy to add new functionality. A rich language may have a lot of syntax sugar which can be expressed in terms of simpler language constructs before doing any type checking, for example.

Each of the languages in this chain is represented as an AST, of course, but usually only the very first one, which was produced by a parser, will be called an "AST", and all the others will be "intermediate languages". Of course, terminology might vary between different schools of thought. I personally prefer to use the term "AST" all the way through.

like image 137
SK-logic Avatar answered Nov 17 '22 12:11

SK-logic


Different representations permit different optimizations.

An AST is a type of intermediate representation, it's not the string you typed and it's not machine code. AST's are useful for some optimizations

  1. Constant folding
  2. Some Inlining
  3. Rewrite rules

But it's horrible for some other things, for example, imagine trying to figure out register spilling in an AST, or in the machine code itself? A compiler is usually structured as a pipeline, each step with it's own IL and it's own set of tasks to perform.

This way, each IL can be custom fitted to be easy to compile from the previous IL and easy to optimize in whatever way. GCC for example IIRC has an IL that's basically like assembly, this is great for doing register-based optimizations, like jiggering with what's loaded when. And this is also trivial to turn into real assembly or just straight machine code.

GCC is composed of many of these little ILs, they exist only as datastructures in the compiler and are created, messed with a bit, and then compiled to a lower level IL.

like image 25
Daniel Gratzer Avatar answered Nov 17 '22 12:11

Daniel Gratzer


ASTs don't have any flow information.

Yet flow information is incredibly helpful in generating efficient code. So you don't have any choice: if you want flow information, you pretty much need something in addition to ASTs.

This isn't unique to GCC; pretty much most compilers do this.

One of the key insights of the AI folks is that "representations make it easy to extract certain facts of interest", and that different representations are good for different kinds of facts.

What this means in practice is that compilers may have several or many representations of the program (control flow graphs, symbol tables, data flow graphs, "triples" (a kind of IL), machine code models, ...) depending on what information is needed by that compiler to do its job well.

like image 1
Ira Baxter Avatar answered Nov 17 '22 12:11

Ira Baxter