Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build Abstract Syntax Trees from grammar specification in Haskell?

I'm working on a project which involves optimizing certain constructs in a very small subset of Java, formalized in BNF.

If I were to do this in Java, I would use a combination of JTB and JavaCC which builds an AST. Visitors are then used to manipulate the tree. But, given the vast libraries for parsing in Haskell (parsec, happy, alex etc), I'm a bit confused in chossing the appropriate library.

So, simply put, when a language is specified in BNF, which library offers the easiest means to build an AST? And what is the best way to go about modifying this tree in idiomatic Haskell?

like image 588
Vamshi Surabhi Avatar asked Sep 10 '13 11:09

Vamshi Surabhi


People also ask

What is an Abstract Syntax Tree how do you construct it?

An abstract syntax tree (AST) is a tree that represents the abstract syntactic structure of a language construct where each interior node and the root node represents an operator, and the children of the node represent the operands of that operator. I've already mentioned that ASTs are more compact than parse trees.

How would you construct an Abstract Syntax Tree for arithmetic expressions?

Typically, you would split the work into a tokenizer which splits the input stream representing the expression into a list of tokens, and a parser which takes the list of tokens and constructs a parse tree\ast from it. The first column is the actual text value. The second represents the token type.

What is the difference between syntax tree and Abstract Syntax Tree?

An Abstract Syntax Tree describes the parse tree logically. It does not need to contain all the syntactical constructs required to parse some source code (white spaces, braces, keywords, parenthesis etc). That's why Parse Tree is also called Concrete Syntax Tree while the AST is called Syntax Tree .

What makes Abstract Syntax Tree better than parse tree?

Abstract syntax trees are important data structures in a compiler. It contains the least unnecessary information. Abstract syntax trees are more compact than a parse tree and can be easily used by a compiler.


2 Answers

Well in Haskell there are 2 main ways of parsing something, parse combinators or a parser generator. Since you already have a BNF I'd suggest the latter.

A good one is alex. GHC's parser IIRC is written using this so you'd be in good company.

Next you'll have a big honking stack of data declarations to parse into:

data JavaClass = {
    className :: Name,
    interfaces :: [Name],
    contents :: [ClassContents],
    ...
 }
  data ClassContents = M Method
                     | F Field
                     | IC InnerClass

and for expressions and whatever else you need. Finally you'll combine these into something like

data TopLevel = JC JavaClass
              | WhateverOtherForms
              | YouWillParse

Once you have this you'll have the entire AST represented as one TopLevel or a list of them depending on how many you classes/files you parse.

To proceed from here depends on what you want to do. There are a number of libraries such as syb (scrap your boilerplate) that let you write very concise tree traversals and modifications. lens is also an option. At a minimum check out Data.Traversable and Data.Foldable.

To modify the tree, you can do something as simple as

ignoreInnerClasses :: JavaClass -> JavaClass
ignoreInnerContents c = c{contents = filter isClass $ contents c}
 --                           ^^^ that is called a record update
    where isClass (IC _) = True
          isClass _      = False

and then you could potentially use something like syb to write

 everywhere (mkT ignoreInnerClass) toplevel

which will traverse everything and apply ignoreInnerClass to all JavaClasses. This is possible to do in lens and many other libraries too, but syb is very easy to read.

like image 146
Daniel Gratzer Avatar answered Oct 04 '22 20:10

Daniel Gratzer


I've never used bnfc-meta (suggested by @phg), but I would strongly recommend you look into BNFC (on hackage: http://hackage.haskell.org/package/BNFC). The basic approach is that you write your grammar in an annotated BNF style, and it will automatically generate an AST, parser, and pretty-printer for the grammar.

How suitable BNFC is depends upon the complexity of your grammar. If it's not context-free, you'll likely have a difficult time making any progress (I did make some success hacking up context-sensitive extensions, but that code's likely bit-rotted by now). The other downside is that your AST will very directly reflect the grammar specification. But since you already have a BNF specification, adding the necessary annotations for BNFC should be rather straightforward, so it's probably the fastest way to get a usable AST. Even if you decide to go a different route, you might be able to take the generated data types as a starting point for a hand-written version.

like image 23
John L Avatar answered Oct 04 '22 20:10

John L