PEG for Python style indentation

Tags:

How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:

Examples of a not-yet-existing programming language:

square x =     x * x

cube x =     x * square x

fib n =   if n <= 1     0   else     fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets

Update: Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:

foo   bar = 1   baz = 2 tap   zap = 3  # should yield (ruby style hashmap): # {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }

727

asked Nov 17 '10 14:11

Matt

2 Answers

So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.

answered Oct 21 '22 23:10

Samsdram

Pure PEG cannot parse indentation.

But peg.js can.

I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.

For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js

/* Initializations */ {   function start(first, tail) {     var done = [first[1]];     for (var i = 0; i < tail.length; i++) {       done = done.concat(tail[i][1][0])       done.push(tail[i][1][1]);     }     return done;   }    var depths = [0];    function indent(s) {     var depth = s.length;      if (depth == depths[0]) return [];      if (depth > depths[0]) {       depths.unshift(depth);       return ["INDENT"];     }      var dents = [];     while (depth < depths[0]) {       depths.shift();       dents.push("DEDENT");     }      if (depth != depths[0]) dents.push("BADDENT");      return dents;   } }  /* The real grammar */ start   = first:line tail:(newline line)* newline? { return start(first, tail) } line    = depth:indent s:text                      { return [depth, s] } indent  = s:" "*                                   { return indent(s) } text    = c:[^\n]*                                 { return c.join("") } newline = "\n"                                     {}

depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.

peg.js produces for the text:

alpha   beta   gamma     delta epsilon     zeta   eta theta   iota

these results:

[    "alpha",    "INDENT",    "beta",    "gamma",    "INDENT",    "delta",    "DEDENT",    "DEDENT",    "epsilon",    "INDENT",    "zeta",    "DEDENT",    "BADDENT",    "eta",    "theta",    "INDENT",    "iota",    "DEDENT",    "",    "" ]

This tokenizer even catches bad indents.

195

answered Oct 21 '22 22:10

nalply

Related questions
                            
                                Python parsing bracketed blocks
                            
                                Getting String Value from Json Object Android
                            
                                Boolean expression (grammar) parser in c++
                            
                                Parse json string to find and element (key / value) [duplicate]
                            
                                How to understand an EDI file?
                            
                                How to write the Visitor Pattern for Abstract Syntax Tree in Python?
                            
                                Is C#'s lambda expression grammar LALR(1)?
                            
                                How to decode JSON with unknown field using Gson?
                            
                                NSNull handling for NSManagedObject properties values
                            
                                Why can’t DateTime.ParseExact() parse the AM/PM in “4/4/2010 4:20:00 PM” using “M'/'d'/'yyyy H':'mm':'ss' 'tt”
                            
                                Convert fraction to float?
                            
                                Parsing CSV / tab-delimited txt file with Python
                            
                                How to parse ISO 8601 into date and time format using Moment js in Javascript?
                            
                                Two semicolons inside a for-loop parentheses
                            
                                How to parse a string into a DateTime object in Perl?
                            
                                BeautifulSoup: object of type 'Response' has no len()
                            
                                Ruby: Extracting Words From String
                            
                                Import CSV file with mixed data types
                            
                                Parsing CSS in JavaScript / jQuery
                            
                                Javascript parser for Java [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PEG for Python style indentation

Tags:

syntax

language-design

parsing

peg

treetop

Matt

People also ask

2 Answers

Samsdram

nalply

Recent Activity

Donate For Us