How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:
Examples of a not-yet-existing programming language:
square x = x * x
cube x = x * square x
fib n = if n <= 1 0 else fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
Update: Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:
foo bar = 1 baz = 2 tap zap = 3 # should yield (ruby style hashmap): # {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
The Parser in CPython is currently a PEG (Parser Expression Grammar) parser. The first version of the parser used to be an LL(1) based parser that was one of the oldest parts of CPython implemented before it was replaced by PEP 617.
Making experiments. As the generated C parser is the one used by Python, this means that if something goes wrong when adding some new rules to the grammar you cannot correctly compile and execute Python anymore.
The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Python code.
So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.
Pure PEG cannot parse indentation.
But peg.js can.
I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.
For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js
/* Initializations */ { function start(first, tail) { var done = [first[1]]; for (var i = 0; i < tail.length; i++) { done = done.concat(tail[i][1][0]) done.push(tail[i][1][1]); } return done; } var depths = [0]; function indent(s) { var depth = s.length; if (depth == depths[0]) return []; if (depth > depths[0]) { depths.unshift(depth); return ["INDENT"]; } var dents = []; while (depth < depths[0]) { depths.shift(); dents.push("DEDENT"); } if (depth != depths[0]) dents.push("BADDENT"); return dents; } } /* The real grammar */ start = first:line tail:(newline line)* newline? { return start(first, tail) } line = depth:indent s:text { return [depth, s] } indent = s:" "* { return indent(s) } text = c:[^\n]* { return c.join("") } newline = "\n" {}
depths
is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.
peg.js produces for the text:
alpha beta gamma delta epsilon zeta eta theta iota
these results:
[ "alpha", "INDENT", "beta", "gamma", "INDENT", "delta", "DEDENT", "DEDENT", "epsilon", "INDENT", "zeta", "DEDENT", "BADDENT", "eta", "theta", "INDENT", "iota", "DEDENT", "", "" ]
This tokenizer even catches bad indents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With