Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PEG for Python style indentation

How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:

Examples of a not-yet-existing programming language:

square x =     x * x 

cube x =     x * square x 

fib n =   if n <= 1     0   else     fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets 

Update: Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:

foo   bar = 1   baz = 2 tap   zap = 3  # should yield (ruby style hashmap): # {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } } 
like image 727
Matt Avatar asked Nov 17 '10 14:11

Matt


People also ask

What is peg in Python?

The Parser in CPython is currently a PEG (Parser Expression Grammar) parser. The first version of the parser used to be an LL(1) based parser that was one of the oldest parts of CPython implemented before it was replaced by PEP 617.

What kind of parser does Python use?

Making experiments. As the generated C parser is the one used by Python, this means that if something goes wrong when adding some new rules to the grammar you cannot correctly compile and execute Python anymore.

How do you write parser in Python?

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Python code.


2 Answers

So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.

like image 23
Samsdram Avatar answered Oct 21 '22 23:10

Samsdram


Pure PEG cannot parse indentation.

But peg.js can.

I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.

For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js

/* Initializations */ {   function start(first, tail) {     var done = [first[1]];     for (var i = 0; i < tail.length; i++) {       done = done.concat(tail[i][1][0])       done.push(tail[i][1][1]);     }     return done;   }    var depths = [0];    function indent(s) {     var depth = s.length;      if (depth == depths[0]) return [];      if (depth > depths[0]) {       depths.unshift(depth);       return ["INDENT"];     }      var dents = [];     while (depth < depths[0]) {       depths.shift();       dents.push("DEDENT");     }      if (depth != depths[0]) dents.push("BADDENT");      return dents;   } }  /* The real grammar */ start   = first:line tail:(newline line)* newline? { return start(first, tail) } line    = depth:indent s:text                      { return [depth, s] } indent  = s:" "*                                   { return indent(s) } text    = c:[^\n]*                                 { return c.join("") } newline = "\n"                                     {} 

depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.

peg.js produces for the text:

alpha   beta   gamma     delta epsilon     zeta   eta theta   iota 

these results:

[    "alpha",    "INDENT",    "beta",    "gamma",    "INDENT",    "delta",    "DEDENT",    "DEDENT",    "epsilon",    "INDENT",    "zeta",    "DEDENT",    "BADDENT",    "eta",    "theta",    "INDENT",    "iota",    "DEDENT",    "",    "" ] 

This tokenizer even catches bad indents.

like image 195
nalply Avatar answered Oct 21 '22 22:10

nalply