Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

recursive descent parser and functional programming

So lately I have been working on writing a simple compiler to better understand compiler concepts. Being a diligent reader of stackoverfolow, it seems there is a consensus that writing a compiler in a functional language is easier than an imperative one. To this end I thought I would try and kill two birds and write a compiler in F# to both learn a functional language and write a compiler at the same time.

I have been reading through the dragon book and decided to start with a recursive descent parser written by hand in F#. The dragon book, however, has almost all of the code samples in an imperative style. For instance, the match token function does a significant part of its work via side effect.

So my question is what would a more traditional functional approach to parsing (i.e. few side effects) look like? I know that the Haskell compiler (GHC) is written in Haskell but I would appreciate a somewhat smaller and easier to comprehend code sample.

Second, is it worthwhile to try and adopt a functional approach to parsing, or is it really on optimizations to intermediate code that functional languages shine and I just haven't gotten there yet? That is, should I fuddle through the parsing in F# using an imperative style and switch to a more functional approach later on?

like image 612
Samsdram Avatar asked Dec 11 '10 00:12

Samsdram


People also ask

Is recursion used in functional programming?

Tail Recursion Elimination is a very interesting feature available in Functional Programming languages, like Haskell and Scala. It makes recursive function calls almost as fast as looping.

What is meant by recursive descent parser?

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar.

What is a recursive descent parser and how does it work?

Recursive descent is a top-down parsing technique that constructs the parse tree from the top and the input is read from left to right. It uses procedures for every terminal and non-terminal entity. This parsing technique recursively parses the input to make a parse tree, which may or may not require back-tracking.


2 Answers

Answer derived from this blog article:

So my question is what would a more traditional functional approach to parsing (i.e. few side effects) look like?

Sounds like you need to separate functional (as in Lisp, Scheme, Standard ML, CAML, OCaml, F#) from purity (absence of side effects, as in Haskell) and incidental language features (algebraic datatypes, pattern matching).

Thanks to algebraic datatypes, pattern matching and higher-order functions, F# is a good for parsing and great for transformations and code generation but most production parsers written in F# are not pure. Historically, the family of languages F# is mostly derived from (the MetaLanguages, or MLs) were bred specifically for this kind of metaprogramming.

Here is a very simple set of mutually-recursive active patterns that parse and evaluate mathematical expressions composed of single digits, + - * operators and bracketed subexpressions:

> let rec (|Term|_|) = function     | Factor(e1, t) ->         let rec aux e1 = function           | '+'::Factor(e2, t) -> aux (e1 + e2) t           | '-'::Factor(e2, t) -> aux (e1 - e2) t           | t -> Some(e1, t)         aux e1 t     | _ -> None   and (|Factor|_|) = function     | '-'::Factor(e, t) -> Some(-e, t)     | Atom(e1, '*'::Factor(e2, t)) -> Some(e1 * e2, t)     | Atom(e, t) -> Some(e, t)     | _ -> None   and (|Atom|_|) = function     | c::t when '0'<=c && c<='9' -> Some(int(string c), t)     | '('::Term(e, ')'::t) -> Some(e, t)     | _ -> None;; val ( |Term|_| ) : char list -> (int * char list) option val ( |Factor|_| ) : char list -> (int * char list) option val ( |Atom|_| ) : char list -> (int * char list) option 

Here is an example of it being used to parse and evaluate an expression:

> let (Term e) = List.ofSeq "1+2*(3-4)*-5";; val e : int * char list = (11, []) 

That's a pure solution that's using pattern matching over lists with F#'s active patterns. In reality, you'll want to define a type for your abstract syntax tree and return a value of that type. This is really easy in F#:

type expr =   | Int of int   | Neg of expr   | Add of expr * expr   | Sub of expr * expr   | Mul of expr * expr    static member (~-) f = Neg f   static member (+) (f, g) = Add(f, g)   static member (-) (f, g) = Sub(f, g)   static member (*) (f, g) = Mul(f, g)  let rec (|Term|_|) = function   | Factor(e1, t) ->       let rec aux e1 = function         | '+'::Factor(e2, t) -> aux (e1 + e2) t         | '-'::Factor(e2, t) -> aux (e1 - e2) t         | t -> Some(e1, t)       aux e1 t   | _ -> None and (|Factor|_|) = function   | '-'::Factor(e, t) -> Some(-e, t)   | Atom(e1, '*'::Factor(e2, t)) -> Some(e1 * e2, t)   | Atom(e, t) -> Some(e, t)   | _ -> None and (|Atom|_|) = function   | c::t when '0'<=c && c<='9' -> Some(Int(int(string c)), t)   | '('::Term(e, ')'::t) -> Some(e, t)   | _ -> None  let (Term e) = List.ofSeq "1+2*(3-4)*-5" 

Note that only one minor change to the parser was required because the AST can also be constructed using the +, - and * operators.

Second, is it worthwhile to try and adopt a functional approach to parsing, or is it really on optimizations to intermediate code that functional languages shine and I just haven't gotten there yet?

You're talking about purity, not functional programming. Purity is not particularly useful in the context of parsing text and, in fact, can be a real hindrance (e.g. interning symbols is a nightmare in Haskell). However, F# has many other benefits that make it good for this set of problems. In particular, although other languages like OCaml have much better tools for parsing, I think F# is the best .NET language in this context.

That is, should I fuddle through the parsing in F# using an imperative style and switch to a more functional approach later on?

Depends entirely upon what you want to make functional. I'd use fslex and fsyacc with pure code to construct ASTs in the actions but impurities for anything like hash consing or generating unique IDs.

You may appreciate the following articles I have written on this subject at this blog (note paywall):

  • "Parsing text with Lex and Yacc" (30th September 2007).
  • "Optimizing a simple bytecode interpreter" (31st October 2007).
  • "Parser combinators" (30th November 2007).
  • "Language-oriented programming: The Term-level Interpreter" (31st December 2007).
  • "Language-oriented programming: Term Rewriting" (16th August 2008).
  • "Run-time code generation using System.Reflection.Emit" (31st August 2008).
  • "Parsing and visualizing binary Geographic Information System data" (30th November 2009).
like image 139
J D Avatar answered Oct 14 '22 14:10

J D


One strategy for functional parsing is monadic parser combinators. You can read some about it here (and follow links) or use a library like FParsec. I do not recommend this approach if you're just learning/starting F#/compilers, though.

Another approach with F# is to use FsLex/FsYacc (in the PowerPack). I kinda loathe Lex/Yacc technology, so I also don't recommend this.

I think you should write a recursive decent parser by hand. I don't have strong feelings regarding a tokenizer, but simply tokeninize the entire file into a(n immutable) list of tokens and then doing recursive descent (and leveraging some pattern-matching) is a good way to to deal with parsing. And of course, you'll want to use discrimated unions to represent the AST output of the parser (a la here).

I haven't read the dragon book in a long time, but I'm apparently the only person on the planet who doesn't like it. I would consider abandoning that text in favor of a book that discusses compilers using some ML-based language, though I can't recommend one offhand.

EDIT

I haven't done one of these in a while, so I took a few minutes to code a small sample.

// AST for tiny language type Op =      | Plus      | Minus  type Expr =      | Literal of int      | BinaryOp of Expr * Op * Expr // left, op, right  type Stmt =     | IfThenElse of Expr * Stmt * Stmt // cond, then, else; 0=false in cond      | Print of Expr  // sample program let input = @"     if 1+1-1 then          print 42      else          print 0"  // expected AST let goal =      IfThenElse(         BinaryOp( BinaryOp(Literal(1),Plus,Literal(1)), Minus, Literal(1)),          Print(Literal(42)),          Print(Literal(0)))   //////////////////////////////////////////////////////////////////////////// // Lexer  type Token =     | IF     | THEN     | ELSE     | PRINT     | NUM of int  // non-negative     | PLUS     | MINUS     | EOF  let makeTokenizer (s:string) =     let i = ref 0     let keywords = [         "if", IF          "then", THEN         "else", ELSE         "print", PRINT         "+", PLUS         "-", MINUS ]     let rec getNextToken() =         if !i >= s.Length then             EOF         elif System.Char.IsWhiteSpace(s.[!i]) then             incr i             getNextToken()         elif System.Char.IsDigit(s.[!i]) then             let mutable j = !i             while j < s.Length && System.Char.IsDigit(s.[j]) do                 j <- j + 1             let numStr = s.Substring(!i, j - !i)             i := j             NUM(System.Int32.Parse(numStr)) // may throw, e.g. if > MAXINT         else              let keyword = keywords |> List.tryPick (fun (kwStr,kwTok) ->                 if s.IndexOf(kwStr, !i) = !i then                     i := !i + kwStr.Length                     Some(kwTok)                 else                     None)             match keyword with             | Some k -> k             | None ->                  failwith "unexpected char '%c' at position %d" s.[!i] !i     getNextToken  let tokens =      let nextToken = makeTokenizer input     let t = ref(nextToken())     [          yield !t         while !t <> EOF do             t := nextToken()             yield !t     ]  printfn "%A" tokens // sanity check our tokenizer works  ///////////////////////////////////////////////////////////////////////// // Parser  let parseExpr toks =     match toks with     | NUM x :: rest ->         let mutable rest = rest         let mutable expr = Literal x         while rest.Head = PLUS || rest.Head = MINUS do             let op,y,r =                  match rest with                 | PLUS::NUM y::t -> Plus, Literal y, t                 | MINUS::NUM y::t -> Minus, Literal y, t                 | _ ->                      failwith "parse error in expression, expected number"             expr <- BinaryOp(expr, op, y)             rest <- r         expr, rest     | _ -> failwith "parse error in expression, expected number" let rec parseStmt toks =     match toks with     | PRINT :: rest ->          let e,rest = parseExpr(rest)         Print(e), rest     | IF :: rest ->         let e,rest = parseExpr(rest)         match rest with         | THEN :: rest ->             let s1,rest = parseStmt(rest)             match rest with             | ELSE :: rest ->                 let s2,rest = parseStmt(rest)                 IfThenElse(e,s1,s2), rest             | _ ->                  failwith "parse error after if branch, espected 'else'"         | _ ->              failwith "parse error after if expression, expected 'then'"     | _ -> failwith "parse error, expected statement" let parseProgram toks =     let s,rest = parseStmt toks     match rest with     | [EOF] -> s     | _ -> failwith "parse error after statement, expected EOF"  let p = parseProgram tokens printfn "%A" p assert( p = goal ) 

(Hopefully there are no egregious bugs.)

like image 23
Brian Avatar answered Oct 14 '22 13:10

Brian