Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I parse an indentation-based language using Instaparse, or any other clojure libraries?

Can Instaparse or another Clojure library be used to parse an indentation-based language? I've seen examples of using Instaparse to parse grammars expressed in EBNF/ABNF. Is there a good way to use it to parse an indentation-aware language like Python?

like image 286
Rob Lachlan Avatar asked May 27 '13 19:05

Rob Lachlan


2 Answers

Apparently, you're not the first person to have this issue with Instaparse.

With most parser generators, you would solve this problem with a custom lexer, using some variation on the scheme proposed by @andrewcooke. However, Instaparse was designed to avoid the need for a lexer and consequently does not provide an interface which uses one.

This lack was specifically raised in issue 9, superseded by issue 10; in the latter, the Instaparse author suggests a workaround:

In the meantime, there's a workaround you could potentially employ. You could map tokens like INDENT and DEDENT to unused characters and then rebuild it as a string, then run instaparse on that. I believe ASCII characters 0-8 and 11-31 are unused and could serve as tokens.

That's certainly a possibility, although it's an aesthetic judgement as to whether that's "doing something very hacky." Still, you could write such a hack in the hopes that it can be removed once issue 10 is resolved. You might want to join in the discussion of that issue.

like image 182
rici Avatar answered Nov 13 '22 14:11

rici


typically to do indentation-based parsing you need three things:

  • extend the tokenizer to make a token from the leading spaces on each line

  • process the stream of tokens, for each line comparing the leading spaces against the current context and indicating whether there's a increase or decrease (so you change having a token at the start of every line to having a token when the indentation level changes)

  • writing a "normal" parser that is aware of the tokens that indicate a change in indent level.

depending on the language you might need to feedback some information from the third part to the second part.

i don't know anything about instaparse (the only reason i am answering is that people who ask "what have you tried so far?" on questions like this really piss me off) so you'd need to look at whether there is some way to place the second stage between the tokenizer and the parser (i scanned the docs and it doesn't seem to have anything that does the second part for you, but you could write that yourself). but it should be able to do the first and third parts ok.

like image 10
andrew cooke Avatar answered Nov 13 '22 14:11

andrew cooke