Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing "off-side" (indentation-based) languages

An off-side language is the one where

...the scope of declarations (a block) in that language is expressed by their indentation.

Examples of such languages are Python, Boo, Nemerle, YAML and several more.

So my question is this: how do I actually parse these? How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)? Are parser generators of any help here or do I have to hand-code lexer/parser myself?

like image 308
Anton Gogolev Avatar asked Feb 01 '10 16:02

Anton Gogolev


2 Answers

Python has a lexer that generates Indent and Dedent tokens, that are equivalent to curly braces ("{", "}"). There is even an example on Stack Overflow, with a simple implementation of such a lexer.

For tabs vs. spaces, Python only has a coding convention: Use 4 spaces per indentation level. Tabs are legal syntax though.

like image 78
Eike Avatar answered Nov 05 '22 10:11

Eike


The easiest way to resolve the tabs versus spaces problem is to disallow combinations of spaces and tabs (this is what's done in F#, for instance). Any modern editor will allow tabs to be converted to some number of spaces.

As for whether you need to abandon parser generators, probably not, but you will have to hack the offsides identification in there somewhere. This may require a bit of creativity on your part. Based on browsing the F# source, it looks like they use a post-lexing step to create additional tokens representing offside language elements.

like image 28
kvb Avatar answered Nov 05 '22 09:11

kvb