Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do Haskell compilers implement the parse-error(t) rule in practice?

The Haskell Report includes a somewhat notorious clause in the layout rules called "parse-error(t)". The purpose of this rule is to avoid forcing the programmer to write braces in single-line let expressions and similar situations. The relevant sentence is:

The side condition parse-error(t) is to be interpreted as follows: if the tokens generated so far by L together with the next token t represent an invalid prefix of the Haskell grammar, and the tokens generated so far by L followed by the token “}” represent a valid prefix of the Haskell grammar, then parse-error(t) is true.

This creates an unusual dependency where the lexer necessarily both produces tokens for the parser and responds to errors produced in the parser by inserting additional tokens for the parser to consume. This is unlike pretty much anything you'll find in any other language definition, and severely complicates the implementation if it is interpreted 100% literally.

Unsurprisingly, no Haskell compiler that I'm aware of implements the entire rule as written. For example, GHC fails to parse the following expression, which is legal according to the report:

let x = 42 in x == 42 == True

There are a wide variety of other similar strange cases. This post has a list of some especially difficult examples. Some of these GHC works correctly on, but it also (as of 7.10.1) fails on this one:

e = case 1 of 1 -> 1 :: Int + 1

Also, it seems GHC has an undocumented language extension called AlternativeLayoutRule that replaces the parse-error(t) clause with a stack of token contexts in the lexer that gives similar results in most cases; however, this is not the default behavior.

What do real-world Haskell compilers (including GHC in particular) do to approximate the parse-error(t) rule during lexing? I'm curious because I'm trying to implement a simple Haskell compiler and this rule is really tripping me up. (See also this related question.)

like image 296
Aaron Rotenberg Avatar asked Sep 02 '15 04:09

Aaron Rotenberg


1 Answers

I don't think the parse-error(t) rule is meant to be hard to implement. Yes, it does require the parser to communicate back to the lexer, but other than that it was probably designed to be relatively easy to implement with the dominant parsing technology of the time: A LALR(1) based generated parser with some small support for error correction, like GNU Bison, or indeed like Happy, which GHC uses.

It might be ironic that, at least partially due to Haskell's success at enabling parser combinator libraries, that old technology is not as dominant as it used to be, at least in the Haskell community.

A LALR(1) (or LR(1)) generated parser has the following features that fit rather well with how the parse-error(t) rule is intended to be interpreted:

  • It never backtracks.
  • Its table-driven decisions mean that it always "knows" whether a given token is legal in the current spot, and if so, what to do with it.

Happy has a special error token that can be used to achieve actions when the current lexical token is not legal. Given this, the most relevant definition in GHC's Happy grammar is

close :: { () } 
        : vccurly               { () } -- context popped in lexer. 
        | error                 {% popContext } 

vccurly ("virtual close curly") is the token the lexer sends when it chooses by itself to close a layout level. popContext is an action defined in the lexer source that removes a layout level from the layout stack. (Note BTW that in this implementation, the error case does not need the lexer to send a vccurly token back).

Using this, all the GHC parser rules have to otherwise is to use close as their nonterminal token for ending an indentation block opened with vocurly. Assuming the rest of the grammar is correct, this implements the rule correctly too.

Or at least, that's the theory. It turns out that this sometimes breaks because of other features of Haskell/GHC that don't fit as well into LALR(1) grammar.

Of your two examples above, the first was changed in Haskell 2010 (because people realized it was too awkward to parse), so GHC is correct there. But the second (e = case 1 of 1 -> 1 :: Int + 1) happens because of a different design decision GHC makes:

Making a parser parse precisely the right language is hard. So GHC's parser follows the following principle:

  • We often parse "over-generously", and filter out the bad cases later.

GHC has sufficient extensions that Int + 1 could parse as a type with enough of them enabled. Also, having to write a LALR(1)-parser to directly handle every combination of enabled/disabled extensions would be really awkward (not sure it's even possible). So it just parses the most general language first, and fails later when it checks if the needed extensions for the result are enabled. But by that time parsing is finished and it's too late to trigger the parse-error rule. (Or so I'm assuming.)

Finally, I should say that I don't think there's anything impossible about handling the parse-error(t) rule even if you're not using a (LA)LR(1) parser. I suspect something like GHC's close token could work well in a combinator one too. But you still need some kind of communication back to the lexer.

like image 184
Ørjan Johansen Avatar answered Oct 03 '22 02:10

Ørjan Johansen