Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are parsing expression grammars suited to parsing the shell command language?

The POSIX shell command language is not easy to parse, largely because of tight coupling between lexing and parsing.

However, parsing expression grammars (PEGs) are often scannerless. By combining lexing and parsing, it seems that I could avoid these problems. The language that I am using (Rust) has a well-maintained PEG library. However, I know of three difficulties that could make it impractical to use this library:

  • Shells must be able to parse line by line, not reading characters past the end of the line.
  • Aliases are purely lexical, and can cause a token to be replaced by any sequence of other tokens in certain situations
  • Shell reserved words are only recognized in certain situations

Is a PEG suited to parsing the shell command language given these requirements, or is a hand-written recursive-descent parser more suitable?

like image 833
Demi Avatar asked Mar 09 '15 02:03

Demi


1 Answers

Yes, a PEG can be used, and none of the issues you note should be a problem. In particular:

1) parsing line by line: most PEG tools will not have any built-in white-space skipping. All white space including newlines must be explicitly handled by you, which means you can handle newline any way you like.

2) You should not use the parse tree from PEG as your AST. Instead you should descend the parse tree and build an AST. For aliases then, after the parse has completed and you're building your AST, you can detect the alias and insert the appropriate expansion for the alias instead.

3) Reserved words are not reserved unless you reserve them. That is, if you have a context where either a reserved word or another alphanumeric symbol can occur, you must first check for the reserved words explicitly, then the arbitrary alphanumeric symbol, because once the PEG decides it has a match, that will not back-track. Anywhere a reserved word is not permitted, simply don't check for it, and your generalised alphanumeric symbol rule will succeed instead.

like image 86
cliffordheath Avatar answered Sep 28 '22 00:09

cliffordheath