Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you parse indentation (python style)?

How would you define your parser and lexer rules to parse a language that uses indentation for defining scope.

I have already googled and found a clever approach for parsing it by generating INDENT and DEDENT tokens in the lexer.

I will go deeper on this problem and post an answer if I come to something interesting, but I would like to see other approaches to the problem.

EDIT: As Charlie pointed out, there is already another thread very similar if not the same. Should my post be deleted?

like image 405
Null303 Avatar asked Dec 10 '08 16:12

Null303


People also ask

How do you read indentation in Python?

Indentation refers to the spaces at the beginning of a code line. Where in other programming languages the indentation in code is for readability only, the indentation in Python is very important. Python uses indentation to indicate a block of code.

How do I get out of indentation in Python?

Just select the text you want indented and hit Tab . To un-indent, select the text and hit Shift + Tab .

How do you indent for loops in Python?

The first line of the for loop must end with a colon, and the body must be indented. The colon at the end of the first line signals the start of a block of statements. Python uses indentation rather than {} or begin / end to show nesting. Any consistent indentation is legal, but almost everyone uses four spaces.


2 Answers

This is kind of hypothetical, as it would depend on what technology you have for your lexer and parser, but the easiest way would seem to be to have BEGINBLOCK and ENDBLOCK tokens analogous to braces in C. Using the "offsides rule" your lexer needs to keep track of a stack of indendtation levels. When the indent level increases, emit a BEGINBLOCK for the parser; when the indentation level decreases, emit ENDBLOCK and pop levels off the stack.

Here's another discussion of this on SO, btw.

like image 57
Charlie Martin Avatar answered Sep 21 '22 18:09

Charlie Martin


Also you can track somewhere in lexer how many ident items are preceding first line and pass it to parser. Most interesting part would be trying to pass it to parser correctly :) If your parser uses lookahead (here I mean parser may query for variable number of tokens before it really going to match even one) then trying to pass it through one global variable seems to be very bad idea (because lexer can slip on next line and change value of indent counter while parser is still trying to parse previous line). Also globals are evil in many other cases ;) Marking first line 'real' token in someway with indent counter is more reasonable. I can't give you exact example (I don't even know what parser and lexer generators are you going to use if any...) but something like storing data on first line tokens (it could be non comfortable if you can't easily get such token from parser) or saving custom data (map that links tokens to indent, array where every line in source code as index and indent value as element value) seems to be enough. One downside of this approach is additional complexity to parser that will need to distinguish between ident values and change its behavior based on it. Something like LOOKAHEAD({ yourConditionInJava }) for JavaCC may work here but it is NOT a very good idea. A lot of additional tokens in your approach seems to be less evil thing to use :)

As another alternative I would suggest is to mix this two approaches. You could generate additional tokens only when indent counter changes its value on next line. It is like artificial BEGIN and END token. In this way you may lower number of 'artificial' tokens in your stream fed into parser from lexer. Only your parser grammar should be adjusted to understand additional tokens...

I didn't tried this (have no real experience with such languages parsing), just sharing my thoughts about possible solutions. Checking already built parsers for this kinds of languages could be of great value for you. Open source is your friend ;)

like image 38
IgorK Avatar answered Sep 19 '22 18:09

IgorK