Is there any known way to implement good error handling for machine generated parsers? Does a "pattern" or a known algorithm exist for this kind of problem?
For "good" I mean something which resembles results obtainable with hand crafted recursive descent parsers and modern compilers: Parser does not stop at first error, can be made to emit "meaningful" errors and not just "unrecognized token in line xyz" one error at a time.
Ideally this approach should be automated as well, not handcrafted.
I am not searching for a library, I need an approach, which can be used in different platforms and ideally would be as language independent as possible.
A formula parse error happens when you enter a formula into a cell, and the spreadsheet software cannot understand what you want it to do. It's like trying to speak a different language without taking the time to learn it first.
A parser generator takes a grammar as input and automatically generates source code that can parse streams of characters using the grammar. The generated code is a parser, which takes a sequence of characters and tries to match the sequence against the grammar.
Parser generator. Parser generators take grammar as input and generate source code, which is parsing in reverse. They construct parsers from regular expressions, which are special strings used to manage and match patterns in text.
I have a rather different perspective on this problem, which is that you shouldn't treat syntax errors as internal compiler errors. Any practical compiler is actually implementing three languages:
You can use automatic parser generator tools, as you are looking for, if you specify the language M in your parser instead of the language L. The trouble with this approach is that language designers always specify L and not M. I can't think of a single case where there's anything like a standard for M.
This isn't just abstract nonsense. There's a recent change to C++ that illustrates this distinction pretty well. It used to be that
template< class T > class X;
template< class T > class Y;
X<Y<int>> foo; // syntax in M
had an error in line three because the characters ">>" were the token for the right shift operator. That line had to be written
X<Y<int> > foo; // syntax in L
The standard was changed not to require the extra space. The reason was that all major compilers had already written code to recognize this case in order to generate a meaningful error message. In other words, they found out that the M language was already implemented everywhere. Once the committee determined that, they transferred the M-syntax into the new version of L.
We would have better language design overall if designers considered the M language at the same time as they're working on the L language. Simply for their own sanity, they'd make some effort to minimize the size of the specification for M, which would be a good thing for everyone. Alas, the world is not there yet.
The upshot is that you need to design your own language M. That's the hard problem. Whether you use an automated tool for it is somewhat beside this point. It helps, but it doesn't get rid of the most time-consuming part.
With a traditional YACC/bison generator you get the yyerror/YYERROR framework, with which it is not easy to generate very useful error messages, due to the unordered backtracking nature of LALR parsers. You can even add error recovery rules there, because you need might need them to suppress wrong error messages in failed rules, where you only wanted to cut parsing rules.
With a PEG-based parser you got the much better ~{}
postfix error action block syntax to work with. See eg. the peg manual.
rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
| ...
rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }
| ...
You get excellent error messages at the actual place of the error. But you have to write PEG rules, which are not so easy to write, esp. when handling operator precedence. This is easier with a LALR parser.
With a simplier recursive descent parser generator you got the same error reporting advantages of PEG, but with a much slower parse speed.
See the same discussion at http://lambda-the-ultimate.org/node/4781
People have been trying to figure out to report and repair syntax errors since the first one. There are many technical papers on how to do this. Hunting for the string "syntax error repair" at scholar.google.com produces 57 hits.
There are really several problems:
1) How to report a meaningful error to the reader. To start with, there is where the parser detects the error, and where the user actually made the error. For instance, a C program might have a '++' operator in a strange place:
void p {
x = y ++
z = 0;
<EOF>
Most parsers will choke when "z" gets encountered, and report it as the place of the error. Yet if the mistake is using '++' when '+' was intended, this report is wrong. Unfortunately, getting this right requires you be able to read the programmer's mind.
You also have the problem of reporting the error context. Do you report the error as being in an expression [on first glance, seems so]? in a statement? In a line? In a function body? In function declaration? Probably you want to report in the narrowest syntactic category which can surround the point of error. (Note that you can't report the function body or declaration as "surrounding" the point of error because they, too, are not complete!) What if the error was really a missing semicolon after the ++? Then the error locations wasn't really "in the expression". What if the repair requires the insertion of a missing string quote? A macro continuation character?
So you have to somehow decide what constitutes the actual error, and that gets us to error repair.
2) Error repair: for tool to proceed in a meaningful way, it has to repair the error. Presuambly this means patching the stream of input tokens to produce a legal program (which you may not be able to do if the source has multiple errors). What if there are several possible patches? It should be obvious the best error report is "yyyy is wrong, I suspect you should have used xxxx". How big a patch should one consider for a repair: just the token which triggered the error, tokens which follow it, how about tokens that precede it?
I note it is hard to do automatic, general error repair proposal on hand-written parsers, because the grammar, needed to guide such repair, isn't explicitly available anywhere. So you would expect automated repair to work best on tools for which the grammar was an explicit artifact.
It may also be that the error repair should take into account common mistakes. If people tend to leave ';' off statements, and inserting one fixes the file, it might be a good repair. If they rarely do that, and there is more than one repair (e.g., replace "++" by "+), then an alternative repair is probably a better suggestion.
3) Semantic impact of the repair. Even if you fix the syntax errors, the repaired program may not be sensible. If your error requires the insertion of an identifier, what identifier should be used?
FWIW, our DMS Software Reengineering Toolkit does automated repair driven completely by the grammar. It operates on the assumption that token at the error point should be deleted, or that some other single token should be insert to it left. This catches missing ";" and extra plus signs; it often succeeds in producing a legal repair. Often it isn't the "right" one. At least it lets the parser proceed to the rest of the source code.
I think the hunt for good, automated error repair will continue for a long time.
FWIW, the paper Syntax Error Repair for a Java-based Parser Generator reports that Burke's Ph.D. thesis:
M.G. Burke, 1983, A practical method for LR and LL syntactic error diagnosis and recovery, PhD thesis, Department of Computer Science, New York University
is pretty good. In particular, it repairs errors by considering and revising the left context of the error as well as error scope. Looks like one can get it from ACM
This is probably not what you want to hear, but your better off hand writing the parser and lexer.
It's not a particularly hard task (especially when compared with writing the semantics analyzer and code generator), and will produce the best results when it comes to error handling.
But don't trust me, trust Walter Bright the author of the first native C++ compiler and inventor of the D programming language.
He has an article on exactly this on Dr.Dobbs here. (error recovery is on page 2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With