Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create AST parser which allows syntax errors?

First, what to read about parsing and building AST?

How to create parser for a language (like SQL) that will build an AST and allow syntax errors?

For example, for "3+4*5":

  +
 / \
3   *
   / \
  4   5

And for "3+4*+" with syntax error, parser would guess that the user meant:

  +
 / \
3   *
   / \
  4   +
     / \
    ?   ?

Where to start?

SQL:

    SELECT_________________
   /           \           \
  .           FROM        JOIN
 / \           |         /    \
a city_name  people   address  ON
                                |
                                =______________
                               /               \
                              .____             .
                             /     \           / \
                            p  address_id     a  id
like image 879
Medvedev Avatar asked Sep 26 '14 12:09

Medvedev


People also ask

What is syntax error in parsing?

For example, a syntax error can be a forgotten quotation mark, a missing semicolon at the end of a line, missing parenthesis, or extra characters. This leads to a parse error, because the code cannot be read and interpreted correctly by the PHP parser.

How do you create an Abstract Syntax Tree?

The Abstract Syntax Tree is generated using both the list of tokens (from the lexical analysis) and the source code. The AST is generated during the syntax analysis stage of the compilation. Any syntax error would be detected and a syntax error message would then be returned, stopping the compilation process.

What is difference between parse tree and Abstract Syntax Tree AST?

A parse tree is a record of the rules (and tokens) used to match some input text whereas a syntax tree records the structure of the input and is insensitive to the grammar that produced it. Combining the above two definitions, An Abstract Syntax Tree describes the parse tree logically.

What is the difference between a parse error and a syntax error?

The way I understand it is that a parse error happens because of a syntax error. You (the developer) write code that contains a 'syntax error'. When that code is compiled, the compiler tries to parse your code but cannot which results in a parse error. If you are dealing with an interpreted language, (PHP, ASP, etc.)


2 Answers

The standard answer to the question of how to build parsers (that build ASTs), is to read the standard texts on compiling. Aho and Ullman's "Dragon" Compiler book is pretty classic. If you haven't got the patience to get the best reference materials, you're going to have more trouble, because they provide theory and investigate subtleties. But here is my answer for people in a hurry, building recursive descent parsers.

One can build parsers with built-in error recovery. There are many papers on this sort of thing, a hot topic in the 1980s. Check out Google Scholar, hunt for "syntax error repair". The basic idea is that the parser, on encountering a parsing error, skips to some well-known beacon (";" a statement delimiter is pretty popular for C-like languages, which is why you got asked in a comment if your language has statement terminators), or proposes various input stream deletions or insertions to climb over the point of the syntax error. The sheer variety of such schemes is surprising. The key idea is generally to take into account as much information around the point of error as possible. One of the most intriguing ideas I ever saw had two parsers, one running N tokens ahead of the other, looking for syntax-error land-mines, and the second parser being feed error repairs based on the N tokens available before it encounters the syntax error. This lets the second parser choose to act differently before arriving at the syntax error. If you don't have this, most parser throw away left context and thus lose the ability to repair. (I never implemented such a scheme.)

The choice of things to insert can often be derived from information used to build the parser (often First and Follow sets) in the first place. This is relatively easy to do with L(AL)R parsers, because the parse tables contain the necessary information and are available to the parser at the point where it encounters an error. If you want to understand how to do this, you need to understand the theory (oops, there's that compiler book again) of how the parsers are constructed. (I have implemented this scheme successfully several times).

Regardless of what you do, syntax error repair doesn't help much, because it is almost impossible to guess what the writer of the parsed document actually intended. This suggests fancy schemes won't be really helpful. I stick to simple ones; people are happy to get an error report and some semi-graceful continuation of parsing.

A real problem with rolling your own parser for a real language, is that real languages are nasty messy things; people building real implementations get it wrong and frozen in stone because of existing code bases, or insist on bending/improving the language (standards are for wimps, goodies are for marketing) because its cool. Expect to spend a lot of time re-calibrating what you think the grammar is, against the ground truth of real code. As a general rule, if you want a working parser, better to get one that has a track record rather than roll it yourself.

A lesson most people that are hell-bent to build a parser don't get, is that if they want to do anything useful with the parse result or tree, they'll need a lot more basic machinery than just the parser. Check my bio for "Life After Parsing".

like image 114
Ira Baxter Avatar answered Nov 13 '22 12:11

Ira Baxter


There are two things the parser could do:

  1. Report the error and have the user try again.
  2. Repair the error and proceed.

Generally speaking the first one is easier (and safer). There may not always be enough information for the parser to infer the intent when the syntax is wrong. Depending on the circumstances, it may be dangerous to proceed with a repair that makes the input syntactically correct but semantically wrong.

I've written a few hand-rolled recursive descent parsers for little languages. When writing code to interpret the grammar rules explicitly (as opposed to using a parser-generator), it's easy to detect errors, because the next token doesn't fit the production rule. Generated parsers tend to spit out a simplistic "expected $(TOKEN_TYPE) here" message, which isn't always useful to the user. With a hand-written parser, it's often easy to give a more specific diagnostic message, but it can be time consuming to cover every case.

If your goal is the report the problem but to keep parsing (so that you can see if there are additional problems), you can put a special AST node in the tree at the point of the error. This keeps the tree from falling apart.

You then have to resync to some point beyond the error in order to continue parsing. As Ira Baxter mentioned in his answer, you might look for a token, like ';', that separates statements. The correct token(s) to look for depends on the language you're parsing. Another possibility is to guess what the user meant (e.g., infer an extra token or a different token at the point the error was detected) and then continue. If you encounter another syntax error within the next few tokens, you could backtrack, make a different guess, and try again.

like image 42
Adrian McCarthy Avatar answered Nov 13 '22 11:11

Adrian McCarthy