Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to implement lexical analyser and parser in Haskell

Tags:

haskell

I got this piece of code here, it is a program written in an imperative programming language structured in Haskell, so the question is "how can I implement a lexer analyser and parser for this language", the program is defined to be a sequence of statements of which there are 6 types: ":=", "goto", "write", "stop", "if goto" and "int"

  1. int n=5
  2. write n
  3. int fac=1
  4. if0 n goto 8
  5. fac := fac * n
  6. n := n-1
  7. goto 4
  8. write fac
  9. stop

I'm kinda lost here, I've read about lexers and parsers, but did not find any example how to implement them,I will appreciate if you could give me piece of code, so that I could try to do it myself, or at least links with useful info

like image 415
John Avatar asked Nov 29 '22 08:11

John


1 Answers

Getting Started Parsing with Parsec

I'll not write the whole thing for you, but I'll get you started with each bit. The three stages we'll go through are:

  1. Define the grammar
  2. Make an Abstract Syntax Tree. (This will look like the grammar, so will be quite easy.)
  3. Write a parser. (This will also look like the grammar, so will be quite easy.)

We could make a separate lexing stage between 2 and 3, but Parsec is happy to do both levels. Lexing is where you split the input into tokens - bits of the input that make sense - the equivalent of words in human language, which are also known as lexemes. Skipping a separate lexing phase means we need to be a little more explicit about whitespace etc.

1. Grammar

First you need to define the grammar. Best done with paper and pencil, but I'll get you started:

  program ::= line {[newline line]}
  line ::= num dot whitespace statement
  statement ::= declaration | write | ifstatement | goto | assignment | stop
  declaration = "Int" whitespace varname equals int
  varname = letter {[alphanum]}
  -- more things here, including the more interesting ifstatement:
  ifstatement = "if" whitespace expression whitespace equals expression whitespace statement


  -- lower level stuff:
  dot = "."
  int = digit {[digit]}
  digit = "0" | "1" | ...  | "9"
  whitespace = " " okspace | "\t" okspace
  okspace = "" | whitespace

Have a think about how that matches your sample program, and think about how you'll finish it off:

1. Int n=5
2. write n
3. Int fac=1
4. if 0 n goto 8          -- unusual
5. fac := fac * n
6. n := n+1               -- complex
7. goto 4
8. write fac
9. stop

The if statement in line 4 is unusual because there's no = or == in it. Perhaps that's to simplify the grammar and it can only accept single variables or integers, with a space between. Perhaps that's a typo and you mean to have an equals sign and arbitrary expressions. Find out which and rewrite the ifstatement part of the grammar.

The assignment in line 6 is complex, because here you have to parse arbitrary arithmetic expressions. As far as I recall there are loads of examples of that knocking about, so I'll gladly skip that for now. Make it another question if you get stuck with that bit, but hopefully you'll have built up your parsing skills with the rest of it first.

2. Abstract Syntax Tree (AST)

An Abstract Syntax Tree represents the combination of tokens that make up your input. In Haskell we can define our own to suit the context, which will make life much simpler

I'm actually compiling this answer (a good way of checking for typos etc), which is why I need some declarations at the top of the code:

module ParseLang where

import Text.Parsec hiding (Line)
import Text.Parsec.String
import Control.Applicative hiding ((<|>), many)

We'll just make a Program a list of Lines, but enforce with the parser that there has to be at least one.

type Program = [Line]

For a Line, it needs a number, and a statement, but the dot is just syntax we don't need to store. We can store the line number as an Int, so although that allows negative numbers in the type declaration, again the parser won't accept negative.

data Line = Line {aNum::Int, aStatement :: Statement} 
   deriving Show

Multiple options are easy to define:

data Statement = Declaration VarName Int
               | Write VarName
               | IfStatement Expression Expression Statement
               | Goto LineNumber
               | Assignment VarName Expression
               | Stop
   deriving Show

Notice the absence of all the syntax cruft/connectives/equal signs leaving just the bits that change.

I'm stopping there - you can finish:

data Expression = Expression -- I've left this one for you
   deriving Show

type VarName = String   -- could use newtype for type safety for these to
type LineNumber = Int   

The lower level syntax doesn't need representing in the AST because we'll use strings for it.

3. The Parser

This bit is now nice and easy. Let's start at the bottom of the syntax tree and work up.

num :: Parser Int
num = read <$> many digit

We've used <$>, which is a synonym for fmap that we got by importing Control.Applicative. Here it alters the values returned by the parser using the pure function on the left, in this case, read. Have a look at this other answer for an introduction to fmap if you're not used to it.

Let's make a parser that parses a string literal then some whitespace:

whitespace = space >> spaces  -- a space then optional spaces

lit :: String -> Parser String
lit xs = string xs <* whitespace

Now that <* is interesting. It looks like <*>, which in effect combines two parsers, and is used in conjunction with <$> which in effect maps a pure function onto the result. *> and <* combine two parsers, but ignore the output of one of them, so string "goto" <* whitespace parses a "goto" and some whitespace, but throws away the whitespace.

Now we're ready to parse a goto statement:

goto :: Parser Statement
goto = Goto <$> (lit "goto" *> num)

Now let's have a go at varName

varName :: Parser VarName
varName = (:) <$> letter <*> many (alphaNum <|> oneOf "'_")

A couple of things are going on there.

1. <|> is alternative choice - one or the other, so (alphaNum <|> oneOf "'_") accepts an alphanumeric character or one of that pair of innocent characters ' and _ you might want to include in a variable name.

2. f <$> parser1 <*> parser2 is a really nice way of combining parsers. It runs parser1, then parser2, then maps the function f over the results f that they produced. It works with many parsers:

--ifstatement = "if" whitespace expression whitespace equals whitespace expression whitespace statement
ifStatement :: Parser Statement
ifstatement = IfStatement 
          <$> (lit "if" >> expression) 
          <*> (lit "=" >> expression)    -- I put = in, but see below
          <*> (whitespace >> statement)  -- I'd be happier with a then in here

If you only allow a VarName or Int instead of a general Expression, then you don't need the equals sign.

Here's how you put it together:

statement :: Parser Statement
statement = goto
        <|> stop
        <|> declaration 
        <|> write
        <|> ifStatement
        <|> assignment


--program ::= line {[newline line]}
program :: Parser Program
program = many1 line


--line ::= num dot whitespace statement
line :: Parser Line
line = Line <$> (num <* char '.') <*> (statement <* char '\n')

But I'll leave you with error messages every time you try to use a parser you've not finished yet, so the whole thing will compile OK and the bits you've defined should work.

stop =        error "You've not yet defined stop"
declaration = error "You've not yet defined declaration"
write =       error "You've not yet defined write"
ifStatement = error "You've not yet defined ifStatement"
assignment =  error "You've not yet defined assignment"
expression =  error "You've not yet defined expression"
like image 152
AndrewC Avatar answered Dec 09 '22 15:12

AndrewC