Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

menhir - associate AST nodes with token locations in source file

I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.

A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?

like image 912
krokodil Avatar asked Jul 11 '17 02:07

krokodil


2 Answers

I don't know if it's a best practice, but I like the approach taken in the abstract syntax tree of the Frama-C system; see https://github.com/Frama-C/Frama-C-snapshot/blob/master/src/kernel_services/ast_data/cil_types.mli

This approach uses "layers" of records and algebraic types nested in each other. The records hold meta-information like source locations, as well as the algebraic "node" you can match on.

For example, here is a part of the representation of expressions:

type ...

and exp = { 
  eid: int; (** unique identifier *)
  enode: exp_node; (** the expression itself *)
  eloc: location; (** location of the expression. *)
}

and exp_node =
  | Const      of constant              (** Constant *)
  | Lval       of lval                  (** Lvalue *)
  | UnOp       of unop * exp * typ
  | BinOp      of binop * exp * exp * typ
...

So given a variable e of type exp, you can access its source location with e.eloc, and pattern match on its abstract syntax tree in e.enode.

So simple, "top-level" matches on syntax are very easy:

let rec is_const_expr e =
  match e.enode with
  | Const _ -> true
  | Lval _ -> false
  | UnOp (_op, e', _typ) -> is_const_expr e'
  | BinOp (_op, l, r, _typ) -> is_const_expr l && is_const_expr r

To match deeper in an expression, you have to go through a record at each level. This adds some syntactic clutter, but not too much, as you can pattern match on only the one record field that interests you:

let optimize_double_negation e =
  match e.enode with
  | UnOp (Neg, { enode = UnOp (Neg, e', _) }, _) -> e'
  | _ -> e

For comparison, on a pure AST without metadata, this would be something like:

let optimize_double_negation e =
  match e.enode with
  | UnOp (Neg, UnOp (Neg, e', _), _) -> e'
  | _ -> e

I find that Frama-C's approach works well in practice.

like image 166
Isabelle Newbie Avatar answered Nov 14 '22 16:11

Isabelle Newbie


You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,

type node = 
  | Typedef of typdef
  | Typeexp of typeexp
  | Literal of string
  | Constant of int
  | ...

type annotated_node = { node : node; loc : loc}

Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,

 match node with
 | {node=Typedef t} -> pp_typedef t
 | ...

Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by @Isabelle Newbie.

A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you actually externalize node attributes. The drawback is that you can't actually guarantee the totality of an attribute since finite maps are no total. Thus it is harder to preserve an invariant that, for example, all nodes have a location.

Since every heap allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap allocated objects without actually wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.

Of course, by saying an address, I do not literally mean the physical or virtual address of an object. OCaml uses a moving GC so an actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.

Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:

let c1 = Literal "hello"
let c2 = Constant 42

Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)

module Locations = Ephemeron.K1.Make(struct
   type t = node
   let hash = Hashtbl.hash (* or your own hash if you have one *)
   let equal = (=) (* or a specilized equal operator *)
end)

The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,

let locations = Locations.create 1337


(* somewhere in the semantics actions, where c1 and c2 are created *)

Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"

And later, you can extract the location:

# Locations.find locs c1;;
- : string = "hello.ml:12:32"

As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nice and easy, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:

let normalize = function
  | Literal str -> Literal (normalize_literal str)
  | node -> node 

It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.

Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.

Speaking of the "monadic approach" that you mentioned in comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) In order to attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not only unique in a program run, but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). And although it looks like that generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as pure functional, as it doesn't contain observable effects.

like image 41
ivg Avatar answered Nov 14 '22 16:11

ivg