Parser errors - pattern for generating error handling automatically

Tags:

Is there any known way to implement good error handling for machine generated parsers? Does a "pattern" or a known algorithm exist for this kind of problem?

For "good" I mean something which resembles results obtainable with hand crafted recursive descent parsers and modern compilers: Parser does not stop at first error, can be made to emit "meaningful" errors and not just "unrecognized token in line xyz" one error at a time.

Ideally this approach should be automated as well, not handcrafted.

I am not searching for a library, I need an approach, which can be used in different platforms and ideally would be as language independent as possible.

701

asked Nov 14 '15 13:11

Michele Giuseppe Fadda

4 Answers

I have a rather different perspective on this problem, which is that you shouldn't treat syntax errors as internal compiler errors. Any practical compiler is actually implementing three languages:

The language L that is the designated target language. Correct programs are members of this language.
The language M that consist of L plus all the errors that are recognized by the compiler. Members of M \ L receive informative errors.
The language Z that compiler terminates normally on. This set should be the set of all possible input strings, but if the compiler crashes on some input, it's not. Members of Z \ M receive generic messages about how the compiler failed, typically of the form "parser failed at line x, char y".

You can use automatic parser generator tools, as you are looking for, if you specify the language M in your parser instead of the language L. The trouble with this approach is that language designers always specify L and not M. I can't think of a single case where there's anything like a standard for M.

This isn't just abstract nonsense. There's a recent change to C++ that illustrates this distinction pretty well. It used to be that

template< class T > class X;
template< class T > class Y;
X<Y<int>> foo; // syntax in M

had an error in line three because the characters ">>" were the token for the right shift operator. That line had to be written

X<Y<int> > foo; // syntax in L

The standard was changed not to require the extra space. The reason was that all major compilers had already written code to recognize this case in order to generate a meaningful error message. In other words, they found out that the M language was already implemented everywhere. Once the committee determined that, they transferred the M-syntax into the new version of L.

We would have better language design overall if designers considered the M language at the same time as they're working on the L language. Simply for their own sanity, they'd make some effort to minimize the size of the specification for M, which would be a good thing for everyone. Alas, the world is not there yet.

The upshot is that you need to design your own language M. That's the hard problem. Whether you use an automated tool for it is somewhat beside this point. It helps, but it doesn't get rid of the most time-consuming part.

105

answered Oct 29 '22 05:10

eh9

With a traditional YACC/bison generator you get the yyerror/YYERROR framework, with which it is not easy to generate very useful error messages, due to the unordered backtracking nature of LALR parsers. You can even add error recovery rules there, because you need might need them to suppress wrong error messages in failed rules, where you only wanted to cut parsing rules.

With a PEG-based parser you got the much better ~{} postfix error action block syntax to work with. See eg. the peg manual.

  rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
         | ...

  rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }
         | ...

You get excellent error messages at the actual place of the error. But you have to write PEG rules, which are not so easy to write, esp. when handling operator precedence. This is easier with a LALR parser.

With a simplier recursive descent parser generator you got the same error reporting advantages of PEG, but with a much slower parse speed.

See the same discussion at http://lambda-the-ultimate.org/node/4781

answered Oct 29 '22 04:10

rurban

People have been trying to figure out to report and repair syntax errors since the first one. There are many technical papers on how to do this. Hunting for the string "syntax error repair" at scholar.google.com produces 57 hits.

There are really several problems:

1) How to report a meaningful error to the reader. To start with, there is where the parser detects the error, and where the user actually made the error. For instance, a C program might have a '++' operator in a strange place:

void p {
 x = y ++
     z = 0;
<EOF>

Most parsers will choke when "z" gets encountered, and report it as the place of the error. Yet if the mistake is using '++' when '+' was intended, this report is wrong. Unfortunately, getting this right requires you be able to read the programmer's mind.

You also have the problem of reporting the error context. Do you report the error as being in an expression [on first glance, seems so]? in a statement? In a line? In a function body? In function declaration? Probably you want to report in the narrowest syntactic category which can surround the point of error. (Note that you can't report the function body or declaration as "surrounding" the point of error because they, too, are not complete!) What if the error was really a missing semicolon after the ++? Then the error locations wasn't really "in the expression". What if the repair requires the insertion of a missing string quote? A macro continuation character?

So you have to somehow decide what constitutes the actual error, and that gets us to error repair.

2) Error repair: for tool to proceed in a meaningful way, it has to repair the error. Presuambly this means patching the stream of input tokens to produce a legal program (which you may not be able to do if the source has multiple errors). What if there are several possible patches? It should be obvious the best error report is "yyyy is wrong, I suspect you should have used xxxx". How big a patch should one consider for a repair: just the token which triggered the error, tokens which follow it, how about tokens that precede it?

I note it is hard to do automatic, general error repair proposal on hand-written parsers, because the grammar, needed to guide such repair, isn't explicitly available anywhere. So you would expect automated repair to work best on tools for which the grammar was an explicit artifact.

It may also be that the error repair should take into account common mistakes. If people tend to leave ';' off statements, and inserting one fixes the file, it might be a good repair. If they rarely do that, and there is more than one repair (e.g., replace "++" by "+), then an alternative repair is probably a better suggestion.

3) Semantic impact of the repair. Even if you fix the syntax errors, the repaired program may not be sensible. If your error requires the insertion of an identifier, what identifier should be used?

FWIW, our DMS Software Reengineering Toolkit does automated repair driven completely by the grammar. It operates on the assumption that token at the error point should be deleted, or that some other single token should be insert to it left. This catches missing ";" and extra plus signs; it often succeeds in producing a legal repair. Often it isn't the "right" one. At least it lets the parser proceed to the rest of the source code.

I think the hunt for good, automated error repair will continue for a long time.

FWIW, the paper Syntax Error Repair for a Java-based Parser Generator reports that Burke's Ph.D. thesis:

M.G. Burke, 1983, A practical method for LR and LL syntactic error diagnosis and recovery, PhD thesis, Department of Computer Science, New York University

is pretty good. In particular, it repairs errors by considering and revising the left context of the error as well as error scope. Looks like one can get it from ACM

answered Oct 29 '22 05:10

Ira Baxter

This is probably not what you want to hear, but your better off hand writing the parser and lexer.

It's not a particularly hard task (especially when compared with writing the semantics analyzer and code generator), and will produce the best results when it comes to error handling.

But don't trust me, trust Walter Bright the author of the first native C++ compiler and inventor of the D programming language.

He has an article on exactly this on Dr.Dobbs here. (error recovery is on page 2)

answered Oct 29 '22 04:10

Computermatronic

Related questions
                            
                                Parsing CSS in C#: extracting all URLs
                            
                                How to split comma-separated key-value pairs with quoted commas
                            
                                Parse string into map Golang
                            
                                C++: How to read a lot of data from formatted text files into program?
                            
                                Can Boost Spirit be used to parse byte stream data?
                            
                                Is setlocale thread-safe function?
                            
                                Parse fixed-width files
                            
                                Generalized Bottom up Parser Combinators in Haskell
                            
                                understanding object_pairs_hook in json.loads()
                            
                                PHP HTML DOMDocument getElementById problems
                            
                                How to parse JSON (AS3)
                            
                                Parse semi-structured values
                            
                                Capture contents inside curly braces
                            
                                Parsing simple MIME files from C/C++?
                            
                                GPX Parser for Java? [closed]
                            
                                Bash: Split text-file into words with non-alphanumeric characters as delimiters
                            
                                Javascript returning NaN
                            
                                How do I conditionally turn @ECHO ON/OFF in a Windows batch file?
                            
                                PEG.js - how to parse c-style comments?
                            
                                Simple html dom parser for laravel 5

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parser errors - pattern for generating error handling automatically

Tags:

parsing

compiler-construction

grammar