Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JFlex match nested comments as one token

In Mathematica a comment starts with (* and ends with *) and comments can be nested. My current approach of scanning a comment with JFlex contains the following code

%xstate IN_COMMENT

"(*"  { yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}

<IN_COMMENT> {
  "(*"        {yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
  [^\*\)\(]*  {return MathematicaElementTypes.COMMENT;}
  "*)"        {yypopstate(); return MathematicaElementTypes.COMMENT;}
  [\*\)\(]    {return MathematicaElementTypes.COMMENT;}
  .           {return MathematicaElementTypes.BAD_CHARACTER;}
}

where the methods yypushstate and yypopstate are defined as

private final LinkedList<Integer> states = new LinkedList();

private void yypushstate(int state) {
    states.addFirst(yystate());
    yybegin(state);
}
private void yypopstate() {
    final int state = states.removeFirst();
    yybegin(state);
}

to give me the opportunity to track how many nested levels of comment I'm dealing with.

Unfortunately, this results in several COMMENT tokens for one comment, because I have to match nested comment starts and comment ends.

Question: Is it possible with JFlex to use its API with methods like yypushback or advance() etc. to return exactly one token over the whole comment range, even if comments are nested?

like image 929
halirutan Avatar asked Jul 10 '14 02:07

halirutan


2 Answers

It seems the bounty was uncalled for as the solution is so simple that I just did not consider it. Let me explain. When scanning a simple nested comment

(* (*..*) *)

I have to track, how many opening comment tokens I see so that I finally, on the last real closing comment can return the whole comment as one token.

What I did not realise was, that JFlex does not need to be told to advance to the next portion when it matches something. After careful review I saw that this is explained here but somewhat hidden in a section I didn't care for:

Because we do not yet return a value to the parser, our scanner proceeds immediately.

Therefore, a rule in flex file like this

[^\(\*\)]+ { }

reads all characters except those that could probably be a comment start/end and does nothing but it advances to the next token.

This means that I can simply do the following. In the YYINITIAL state, I have a rule that matches a beginning comment but it does nothing else then switch the lexer to the IN_COMMENT state. In particular, it does not return any token:

{CommentStart}      { yypushstate(IN_COMMENT);}

Now, we are in the IN_COMMENT state and there, I do the same. I eat up all characters but never return a token. When I hit a new opening comment, I carefully push it onto a stack but do nothing. Only, when I hit the last closing comment, I know I'm leaving the IN_COMMENT state and this is the only point, where I, finally, return the token. Let's look at the rules:

<IN_COMMENT> {
  {CommentStart}  { yypushstate(IN_COMMENT);}
  [^\(\*\)]+      { }
  {CommentEnd}    {  yypopstate();
                     if(yystate() != IN_COMMENT)
                       return MathematicaElementTypes.COMMENT_CONTENT;
                  }
    [\*\)\(]      { }
    .             { return MathematicaElementTypes.BAD_CHARACTER; }
}

That's it. Now, no matter how deep your comment is nested, you will always get one single token that contains the entire comment.

Now, I'm embarrassed and I'm sorry for such a simple question.

Final note

If you are doing something like this, you have to remember that you only return a token from when you hit the correct closing "character". Therefore, you definitely should make a rule that catches the end of file. In IDEA that default behavior is to just return the comment token, so you need another line (useful or not, I want to end gracefully):

    <<EOF>>  { yyclearstack(); yybegin(YYINITIAL);
               return MathematicaElementTypes.COMMENT;}
like image 51
halirutan Avatar answered Dec 16 '22 16:12

halirutan


When I wrote the answer first I had even not realized that one of the existing answers was of the questioner itself. On the other hand, I seldom find a bounty in the rather small SO lex community. Thus, this seemed to me worth to learn enough Java and jflex to write a sample:

/* JFlex scanner: to recognize nested comments in Mathematica style
 */

%%

%{
  /* counter for open (nested) comments */
  int open = 0;
%}

%state IN_COMMENT

%%

/* any state */

"(*" { if (!open++) yybegin(IN_COMMENT); }

"*)" { 
    if (open) {
      if (!--open) {
        yybegin(YYINITIAL);
        return MathematicaElementTypes.COMMENT;
      }
    } else {
      /* or return MathematicaElementTypes.BAD_CHARACTER;
      /* or: throw new Error("'*)' without '(*'!"); */
    }
  }

<IN_COMMENT> {
  . |
  \n { }
}

<<EOF>> {
    if (open) {
      /* This is obsolete if the scanner is instanced new for
       * each invocation.
       */
      open = 0; yybegin(IN_COMMENT);
      /* Notify about syntax error, e.g. */
      throw new Error("Premature end of file! ("
        + open + " open comments not closed.)");
    }
    return MathematicaElementTypes.EOF; /* just a guess */
  }

There might be typos and stupid errors although I tried to be carefully and did my best.

As a "proof of concept" I leave my original implementation here which is done with flex and C/C++.

This scanner

  • handles comment (with printf())
  • echoes everything else.

My solution is essentially based on the fact that flex rules may end with break or return. Therefore, the token is simply not returned until the rule for the pattern is matched closing the outmost comment. Contents in comments is simply "recorded" in a buffer – in my case a std::string. (AFAIK, string is even a built-in type in Java. Therefore, I decided to mix C and C++ which I usally wouldn't.)

My source scan-nested-comments.l:

%{
#include <cstdio>
#include <string>

// counter for open (nested) comments
static int open = 0;
// buffer for collected comments
static std::string comment;
%}

/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap

%s IN_COMMENT

%%

"(*" {
  if (!open++) BEGIN(IN_COMMENT);
  comment += "(*";
}

"*)" {
  if (open) {
    comment += "*)";
    if (!--open) {
      BEGIN(INITIAL);
      printf("EMIT TOKEN COMMENT(lexem: '%s')\n", comment.c_str());
      comment.clear();
    }
  } else {
    printf("ERROR: '*)' without '(*'!\n");
  }
}

<IN_COMMENT>{
  . |
  "\n" { comment += *yytext; }
}

<<EOF>> {
  if (open) {
    printf("ERROR: Premature end of file!\n"
      "(%d open comments not closed.)\n", open);
    return 1;
  }
  return 0;
}

%%

int main(int argc, char **argv)
{
  if (argc > 1) {
    yyin = fopen(argv[1], "r");
    if (!yyin) {
      printf("Cannot open file '%s'!\n", argv[1]);
      return 1;
    }
  } else yyin = stdin;
  return yylex();
}

I compiled it with flex and g++ in cygwin on Windows 10 (64 bit):

$ flex -oscan-nested-comments.cc scan-nested-comments.l ; g++ -o scan-nested-comments scan-nested-comments.cc
scan-nested-comments.cc:398:0: warning: "yywrap" redefined

 ^
scan-nested-comments.cc:74:0: note: this is the location of the previous definition

 ^

$

The warning appears due to %option noyywrap. I guess it does not mean any harm and can be ignored.

Now, I made some tests:

$ cat >good-text.txt <<EOF
> Test for nested comments.
> (* a comment *)
> (* a (* nested *) comment *)
> No comment.
> (* a
> (* nested
> (* multiline *)
>  *)
>  comment *)
> End of file.
> EOF

$ cat good-text | ./scan-nested-comments
Test for nested comments.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')

EMIT TOKEN COMMENT(lexem: '(* a (* nested *) comment *)')

No comment.
EMIT TOKEN COMMENT(lexem: '(* a
(* nested
(* multiline *)
 *)
 comment *)')

End of file.

$ cat >bad-text-1.txt <<EOF
> Test for wrong comment.
> (* a comment *)
> with wrong nesting *)
> End of file.
> EOF

$ cat >bad-text-1.txt | ./scan-nested-comments
Test for wrong comment.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')

with wrong nesting ERROR: '*)' without '(*'!

End of file.

$ cat >bad-text-2.txt <<EOF
> Test for wrong comment.
> (* a comment
> which is not closed.
> End of file.
> EOF

$ cat >bad-text-2.txt | ./scan-nested-comments
Test for wrong comment.
ERROR: Premature end of file!
(1 open comments not closed.)

$
like image 35
Scheff's Cat Avatar answered Dec 16 '22 16:12

Scheff's Cat