Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Putting nodes into the parse tree which shouldn't be there

I am writing a parser for a language, and the scanner is designed to

  1. either also return unneeded terminals (e.g. whitespacing) OR
  2. not to do so

based on a boolean flag.

Now, in the parser, I don't want to clutter the grammar with all those terminals, they should be swallowed somehow "automagically" by the parse tree that I'm constructing.

To do this "magic", I thought I would chain the terminals (simply linked circular list) so I could just iterate them and "fill in the blanks" as the reduction happens (I'm using a LALR(1) parser generator).

It sounds like a sane idea, though there is one problem. Remember I said "to either return ... or not"? In scenario (2), I would free the terminal, because who knows what comes next? And I don't want any memory leaks.

But in scenario (1), I cannot free the terminal, because based on them I will decide in further reductions where that "fill in the blanks" process should stop.

I cannot free it conditionally either, because of the same reason: I don't know what comes next. What if there won't be any "fill-in-the-blanks"-process triggered? What if there won't be any further reduction at all?

Have you had similar problems? How have you solved it?

Note: this is all in my mind and I may have not explained it clearly enough, please ask and I'll edit my question. The scenario is actually a bit more complex, I'm not writing this from scratch, where I could use my imagination, I am integrating it into something else, so it may well be that I will answer with "I can't do that because of the environment constraints".


Addendum

The only really good idea that comes to my mind is to fork and improve the parser generator, which I've already done in some minor places here and there, to overcome some of those limitations I was mentioning above.

like image 393
Flavius Avatar asked Aug 07 '11 11:08

Flavius


1 Answers

Your vocabulary is a little odd. Most parsers are designed to recognize the syntax of the langauge. usually the language defintion defines some notion of terminals and explicitly excludes "whitespace", consisting of uninteresting sequnces of text between the text of terminals, often including blanks, tabs, and various kinds of free standing comments. So the word "terminal" used in parsing usually means "those language atoms that are not whitespace". You've defined it implicitly to include whitespace, and I think that is causing your grief.

From this point of view, the easiest way to avoid clutting the grammar definition used by your parser with whitespace, is to simply have the lexer not pass whitespace to the parser. Then your grammar doesn't need to indicate how they are handled (and yes, grammars that do so are really messy), the parser doesn't have to worry about them and they don't show up in the tree.

If you are building a compiler or an interpreter, then ignoring whitespace is easiest.

If you are building a re-engineering parser (see our DMS Software Reengineering Toolkit, then capturing comments (at least) in the AST is important, as eventually one want to regenerate text from consturcted ASTs, and it is helpful if the regenerated text contains the comments, too. [You can do this other ways but they aren't as easy].

The DMS lexer produces "micro"tokens which are your concept of langauge tokens, whitespace, and comments, internally. It throws away whitespace micro-tokens because they simply don't add anything (see above discussion). It passes conventional tokens to the parser, as you'd expect. It glues comment tokens to the preceding or following language token, depending on the token type and where encountered; for C, a /* ... */ seen before a token is attached to it, and a // ... comment is attached to the preceding token (with a few more subtle details not discussed here). Then the parse still only sees language tokens, so the grammar isn't needlessly complicated, and if all the information attached to the token is placed in the tree, the comments go along for the ride.

Now, people often want "Abstract" syntax trees; they want to leave out things like "(" and ")". The scheme I describe above attached comments to even concrete tokens like these. Now there's a complication: if you leave the ( .. ) tokens out of the tree, attached comments vanish. Oops. So DMS parsers do a complicated thing: comments attached to tokens that have logical place in the tree but aren't actually there ("eliminated terminals") are lifted to the parent tree node with an annotation saying they belong on the missing child token. Yes, implementing this is indeed a PITA. The good news is we only had to do it once in DMS's general parsing machinery, and it works for many, many languages. But this means you have to be willing to build an unusual ("reengineering") parser, and we had commercial motivation to do so.

EDIT: It isn't clear why OP wants this, but he insists on capturing whitespace in the tree. Since he hasn't told us why, I'm going to guess: he wants precise column information for the tokens/tree nodes. That isn't hard to do: teach the lexer to keep track of position (line/column), and stamp each token (micro-tokens such as comments, too) with start/end position, and let the parse store that information in the tree. This way avoids keeping whitespace in the tree, too. (DMS does this too, because when reporting problems, precise information is useful, and when regenerating code, placing code back at its original position (at least the same column) is often desirable).

EDIT2: If OP insists on capturing whitespace, he might consider exploring scannerless GLR parsing. This keeps every character in the input stream, including the whitespace.

like image 101
Ira Baxter Avatar answered Sep 23 '22 06:09

Ira Baxter