Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grammar rules for comments

I am working with reflect.js (a nice Javascript parser) from Zach Carter on github; I am trying to modify the behavior of his parser to handle comments as normal tokens that should be parsed like anything else. The default behavior of reflect.js is to keep track of all comments (the lexer grabs them as tokens) and then append a list of them to the end of the AST (Abstract Syntax Tree) it creates.

However, I would like these comments to be included in-place in the AST. I believe this change will involve adding grammar rules to the grammar.y file here . There are currently no rules for comments -- If my understanding is correct, that is why they are ignored by the main parsing code.

How do you write rules to include comments in an AST?

like image 694
BlackVegetable Avatar asked Oct 21 '22 05:10

BlackVegetable


1 Answers

The naive version modifies each rule of the original grammer:

      LHS = RHS1 RHS2 ... RHSN ;

to be:

      LHS =  RHS1 COMMENTS RHS2 COMMENTS ... COMMENTS RHSN ;

While this works in the abstract, this will likely screw up your parser generator if it is LL or LALR based, because now it can't see far enough ahead with just the next token to decide what to do. So you'd have to switch to a more powerful parser generator such as GLR.

A smarter version replaces (only and) every terminal T with a nonterminal:

      T  =  COMMENTS t ;

and modifies the orginal lexer to trivally emit t instead of T. You still have lookahead troubles.

But this gives us the basis for real solution.

A more sophisticated version of this is to cause the lexer to collect comments seen before a token and attach them to next token it emits; in essence, we are implementing the terminal rule modification of the grammar, in the lexer.

Now the parser (you don't have to switch technologies) just sees the tokens it originally saw; the tokens carry the comments as annotations. You'll find it useful to divide comments into those that attach to the previous token, and those that attach to the next, but you won't be able to make this any better than a heuristic, because there is no practical way to decide to which token the comments really belong.

You'll find it fun to figure out how to capture the positioning information on the tokens and the comments, to enable regeneration of the original text ("comments in their proper locations"). You'll find it more fun to actually regenerate the text with appropriate radix values, character string escapes, etc., in a way that doesn't break the language syntax rules.

We do this with our general language processing tools and it works reasonably well. It is amazing how much work it is to get it all straight, so that you can focus on your transformation task. People underestimate this a lot.

like image 161
Ira Baxter Avatar answered Oct 27 '22 11:10

Ira Baxter