I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule.
Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element.
Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language:
token NEWLINE
matches r"\n\ *"
increase line count
pick up and store the indentation level
remember to also record the current level of parenthesis
procedure layout tokens
level = stack of indentation levels
push 0 to level
last_newline = none
per each token
if it is NEWLINE put it to last_newline and get next token
if last_newline contains something
extract new_level and parenthesis_count from last_newline
- if newline was inside parentheses, do nothing
- if new_level > level.top
push new_level to level
emit last_newline as INDENT token and clear last_newline
- if new_level == level.top
emit last_newline and clear last_newline
- otherwise
while new_level < level.top
pop from level
if new_level > level.top
freak out, indentation is broken.
emit last_newline as DEDENT token
clear last_newline
emit token
while level.top != 0
emit token as DEDENT token
pop from level
comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser
This layouter doesn't generate more than one NEWLINE at time, and doesn't generate NEWLINE when there's indentation coming up. Therefore parsing rules remain quite simple. It's pretty good I think but inform if there's better way of accomplishing it.
While using this for a while, I've noticed that after DEDENTs it may be nice to emit newline anyway, this way you can separate the expressions with NEWLINE while keeping the INDENT DEDENT as a trailer for expression.
A player is in an offside position if: any part of the head, body or feet is in the opponents' half (excluding the halfway line) and. any part of the head, body or feet is nearer to the opponents' goal line than both the ball and the second-last opponent.
The first widely agreed offside law was cemented with the formation of the Football Association (FA) and the publication of its laws in 1863. The rule prohibited any attacking player from being in front of the ball when it was kicked forward.
The usual theory when discussing what football might look like without offsides is that the game would descend into a goal-hanging contest. Strikers would stay in the box and wait. As soon as possession was turned over, the ball would be fired in their direction.
In general, offside rules intend to prevent players from “goal-hanging” – staying near the opponent's goal and waiting for the ball to be passed to them directly. This was considered to be unsportsmanlike and made the game boring.
I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.
Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.
I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:
this line introduces an indented block of literal text:
this line of the block is indented four spaces
but this line is only indented two spaces
which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.
Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....
Ive been experimenting with this recently, and I came to the conclusion that, for my needs at least, I wanted the NEWLINES to mark the end of each "statement", whether it was the last statement in an indented block or not, i.e. I need the newlines even before DEDENT.
My solution was to turn it on its head, and instead of NEWLINES marking the end of lines, I use a LINE token to mark the start of a line.
I have a lexer that collapses empty lines (including comment-only lines) and emits a single LINE token with information about the indentation of the last line. Then my preprocessing function takes this token stream and adds INDENT or DEDENT "in between" any lines where the indentation changes. So
line1
line2
line3
line4
would give the token stream
LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF
This allows me to write clear grammar productions for statements without worrying about detecting the end of statements even when they end with nested, indented, subblocks, something that can be hard if you are matching NEWLINES (and DEDENTS) instead.
Here is the core of the preprocessor, written in O'Caml:
match next_token () with
LINE indentation ->
if indentation > !current_indentation then
(
Stack.push !current_indentation indentation_stack;
current_indentation := indentation;
INDENT
)
else if indentation < !current_indentation then
(
let prev = Stack.pop indentation_stack in
if indentation > prev then
(
current_indentation := indentation;
BAD_DEDENT
)
else
(
current_indentation := prev;
DEDENT
)
)
else (* indentation = !current_indentation *)
let token = remove_next_token () in
if next_token () = EOF then
remove_next_token ()
else
token
| _ ->
remove_next_token ()
I haven't added support for parentheses yet, but that should be a simple extension. It does, however avoid emitting a stray LINE at the end of the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With