Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where/how to declare the unique key of variables in a compiler written in Ocaml?

I am writing a compiler of mini-pascal in Ocaml. I would like my compiler to accept the following code for instance:

program test;
var
   a,b : boolean;
   n : integer;
begin
   ...
end.

I have difficulties in dealing with the declaration of variables (the part following var). At the moment, the type of variables is defined like this in sib_syntax.ml:

type s_var =
    { s_var_name: string;
      s_var_type: s_type; 
      s_var_uniqueId: s_uniqueId (* key *) }

Where s_var_uniqueId (instead of s_var_name) is the unique key of the variables. My first question is, where and how I could implement the mechanism of generating a new id (actually by increasing the biggest id by 1) every time I have got a new variable. I am wondering if I should implement it in sib_parser.mly, which probably involves a static variable cur_id and the modification of the part of binding, again don't know how to realize them in .mly. Or should I implement the mechanism at the next stage - the interpreter.ml? but in this case, the question is how to make the .mly consistent with the type s_var, what s_var_uniqueId should I provide in the part of binding?

Another question is about this part of statement in .mly:

id = IDENT COLONEQ e = expression
  { Sc_assign (Sle_var {s_var_name = id; s_var_type = St_void}, e) }

Here, I also need to provide the next level (the interpreter.ml) a variable of which I only know the s_var_name, so what could I do regarding its s_var_type and s_var_uniqueId here?

Could anyone help? Thank you very much!

like image 368
SoftTimur Avatar asked Jun 29 '11 09:06

SoftTimur


People also ask

What does :: do in OCaml?

Regarding the :: symbol - as already mentioned, it is used to create lists from a single element and a list ( 1::[2;3] creates a list [1;2;3] ).

What does -> mean in OCaml?

The way -> is defined, a function always takes one argument and returns only one element. A function with multiple parameters can be translated into a sequence of unary functions.

What does in mean OCaml?

in" is used to name values used throughout * the computation (i.e., to make local variables). *) let alen = List.length a in let blen = List.length b in if alen < blen then -1 else if alen > blen then 1 else 0 let rec split list = (* Here "let ... in" is used to name * the parts of an intermediate result.


2 Answers

The first question to ask yourself is whether you actually need an unique id. From my experience, they're almost never necessary or even useful. If what you're trying to do is making variables unique through alpha-equivalence, then this should happen after parsing is complete, and will probably involve some form of DeBruijn indices instead of unique identifiers.

Either way, a function which returns a new integer identifier every time it is called is:

let unique = 
  let last = ref 0 in 
  fun () -> incr last ; !last

let one = unique ()  (* 1 *)
let two = unique ()  (* 2 *)

So, you can simply assign { ... ; s_var_uniqueId = unique () } in your Menhir rules.

The more important problem you're trying to solve here is that of variable binding. Variable x is defined in one location and used in another, and you need to determine that it happens to be the same variable in both places. There are many ways of doing this, one of them being to delay the binding until the interpreter. I'm going to show you how to deal with this during parsing.

First, I'm going to define a context: it's a set of variables that allows you to easily retrieve a variable based on its name. You might want to create it with hash tables or maps, but to keep things simple I will be using List.assoc here.

type s_context = {
  s_ctx_parent : s_context option ;
  s_ctx_bindings : (string * (int * s_type)) list ;
  s_ctx_size : int ;
}

let empty_context parent = {
  s_ctx_parent = parent ;
  s_ctx_bindings = [] ;
  s_ctx_size = 0
}

let bind v_name v_type ctx = 
  try let _ = List.assoc ctx.s_ctx_bindings v_name in
      failwith "Variable is already defined"
  with Not_found -> 
    { ctx with 
      s_ctx_bindings = (v_name, (ctx.s_ctx_size, v_type)) 
        :: ctx.s_ctx_bindings ;
      s_ctx_size = ctx.s_ctx_size + 1 }

let rec find v_name ctx =       
  try 0, List.assoc ctx.s_ctx_bindings v_name
  with Not_found -> 
    match ctx.s_ctx_parent with 
      | Some parent -> let depth, found = find v_name parent in
                       depth + 1, found
      | None -> failwith "Variable is not defined"

So, bind adds a new variable to the current context, find looks for a variable in the current context and its parents, and returns both the bound data and the depth at which it was found. So, you could have all global variables in one context, then all parameters of a function in another context that has the global context as its parent, then all local variables in a function (when you'll have them) in a third context that has the function's main context as the parent, and so on.

So, for instance, find 'x' ctx will return something like 0, (3, St_int) where 0 is the DeBruijn index of the variable, 3 is the position of the variable in the context identified by the DeBruijn index, and St_int is the type.

type s_var = {
  s_var_deBruijn: int;
  s_var_type: s_type;
  s_var_pos: int 
}

let find v_name ctx = 
   let deBruijn, (pos, typ) = find v_name ctx in 
   { s_var_deBruijn = deBruijn ;
     s_var_type = typ ;
     s_var_pos = pos }

Of course, you need your functions to store their context, and make sure that the first argument is the variable at position 0 within the context:

type s_fun =
{ s_fun_name: string;
  s_fun_type: s_type;
  s_fun_params: context; 
  s_fun_body: s_block; }

let context_of_paramlist parent paramlist = 
  List.fold_left 
    (fun ctx (v_name,v_type) -> bind v_name v_type ctx) 
    (empty_context parent)
    paramlist

Then, you can change your parser to take into account the context. The trick is that instead of returning an object representing part of your AST, most of your rules will return a function that takes a context as an argument and returns an AST node.

For instance:

int_expression:
  (* Constant : ignore the context *)
| c = INT { fun _ -> Se_const (Sc_int c) }
  (* Variable : look for the variable inside the contex *)
| id = IDENT { fun ctx -> Se_var (find id ctx) }
  (* Subexpressions : pass the context to both *)
| e1 = int_expression o = operator e2 = int_expression 
  { fun ctx -> Se_binary (o, e1 ctx, e2 ctx) }
;

So, you simply propagate the context "down" recursively through the expressions. The only clever parts are those when new contexts are created (you don't have this syntax yet, so I'm just adding a placeholder):

| function_definition_expression (args, body) 
  { fun ctx -> let ctx = context_of_paramlist (Some ctx) args in
               { s_fun_params = ctx ; 
                 s_fun_body = body ctx } }

As well as the global context (the program rule itself does not return a function, but the block rule does, and so a context is created from the globals and provided).

prog:
  PROGRAM IDENT SEMICOLON
  globals = variables
  main = block
  DOT
    { let ctx = context_of_paramlist None globals in 
      { globals = ctx;
        main = main ctx } }

All of this makes the implementation of your interpreter much easier due to the DeBruijn indices: you can have a "stack" which holds your values (of type value) defined as:

type stack = value array list 

Then, reading and writing variable x is as simple as:

let read stack x = 
  (List.nth stack x.s_var_deBruijn).(x.s_var_pos)

let write stack x value = 
  (List.nth stack x.s_var_deBruijn).(x.s_var_pos) <- value

Also, since we made sure that function parameters are in the same order as their position in the function context, if you want to call function f and its arguments are stored in the array args, then constructing the stack is as simple as:

let inner_stack = args :: stack in
(* Evaluate f.s_fun_body with inner_stack here *)

But I'm sure you'll have a lot more questions to ask when you start working on your interpeter ;)

like image 138
Victor Nicollet Avatar answered Oct 10 '22 17:10

Victor Nicollet


How to create a global id generator:

let unique =
  let counter = ref (-1) in
  fun () -> incr counter; !counter

Test:

# unique ();;
- : int = 0
# unique ();;
- : int = 1

Regarding your more general design question: it seems that your data representation does not faithfully represent the compiler phases. If you must return a type-aware data-type (with this field s_var_type) after the parsing phase, something is wrong. You have two choices:

  • devise a more precise data representation for the post-parsing AST, that would be different from the post-typing AST, and not have those s_var_type fields. Typing would then be a conversion from the untyped to the typed AST. This is a clean solution that I would recommend.

  • admit that you must break the data representation semantics because you don't have enough information at this stage, and try to be at peace with the idea of returning garbage such as St_void after the parsing phase, to reconstruct the correct information later. This is less typed (as you have an implicit assumption on your data which is not apparent in the type), more pragmatic, ugly but sometimes necessary. I don't think it's the right decision in this case, but you will encounter situation where it's better to be a bit less typed.

I think the specific choice of unique id handling design depends on your position on this more general question, and your concrete decisions about types. If you choose a finer-typed representation of post-parsing AST, it's your choice to decide whether to include unique ids or not (I would, because generating a unique ID is dead simple and doesn't need a separate pass, and I would rather slightly complexify the grammar productions than the typing phase). If you choose to hack the type field with a dummy value, it's also reasonable to do that for variable ids if you wish to, putting 0 as a dummy value and defining it later; but still I personally would do that in the parsing phase.

like image 2
gasche Avatar answered Oct 10 '22 17:10

gasche