Can I parse an indentation-based language using Instaparse, or any other clojure libraries?

Question

Can Instaparse or another Clojure library be used to parse an indentation-based language? I've seen examples of using Instaparse to parse grammars expressed in EBNF/ABNF. Is there a good way to use it to parse an indentation-aware language like Python?

rici · Accepted Answer

Apparently, you're not the first person to have this issue with Instaparse.

With most parser generators, you would solve this problem with a custom lexer, using some variation on the scheme proposed by @andrewcooke. However, Instaparse was designed to avoid the need for a lexer and consequently does not provide an interface which uses one.

This lack was specifically raised in issue 9, superseded by issue 10; in the latter, the Instaparse author suggests a workaround:

In the meantime, there's a workaround you could potentially employ. You could map tokens like INDENT and DEDENT to unused characters and then rebuild it as a string, then run instaparse on that. I believe ASCII characters 0-8 and 11-31 are unused and could serve as tokens.

That's certainly a possibility, although it's an aesthetic judgement as to whether that's "doing something very hacky." Still, you could write such a hack in the hopes that it can be removed once issue 10 is resolved. You might want to join in the discussion of that issue.

andrew cooke · Answer

typically to do indentation-based parsing you need three things:

extend the tokenizer to make a token from the leading spaces on each line
process the stream of tokens, for each line comparing the leading spaces against the current context and indicating whether there's a increase or decrease (so you change having a token at the start of every line to having a token when the indentation level changes)
writing a "normal" parser that is aware of the tokens that indicate a change in indent level.

depending on the language you might need to feedback some information from the third part to the second part.

i don't know anything about instaparse (the only reason i am answering is that people who ask "what have you tried so far?" on questions like this really piss me off) so you'd need to look at whether there is some way to place the second stage between the tokenizer and the parser (i scanned the docs and it doesn't seem to have anything that does the second part for you, but you could write that yourself). but it should be able to do the first and third parts ok.

Can I parse an indentation-based language using Instaparse, or any other clojure libraries?

Tags:

parsing

clojure

Rob Lachlan

2 Answers

rici

andrew cooke

Recent Activity

Donate For Us

Can I parse an indentation-based language using Instaparse, or any other clojure libraries?

Tags:

parsing

clojure

Rob Lachlan

2 Answers

rici

andrew cooke

Related questions

Recent Activity

Donate For Us