An off-side language is the one where
...the scope of declarations (a block) in that language is expressed by their indentation.
Examples of such languages are Python, Boo, Nemerle, YAML and several more.
So my question is this: how do I actually parse these? How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)? Are parser generators of any help here or do I have to hand-code lexer/parser myself?
Python has a lexer that generates Indent and Dedent tokens, that are equivalent to curly braces ("{", "}"). There is even an example on Stack Overflow, with a simple implementation of such a lexer.
For tabs vs. spaces, Python only has a coding convention: Use 4 spaces per indentation level. Tabs are legal syntax though.
The easiest way to resolve the tabs versus spaces problem is to disallow combinations of spaces and tabs (this is what's done in F#, for instance). Any modern editor will allow tabs to be converted to some number of spaces.
As for whether you need to abandon parser generators, probably not, but you will have to hack the offsides identification in there somewhere. This may require a bit of creativity on your part. Based on browsing the F# source, it looks like they use a post-lexing step to create additional tokens representing offside language elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With