Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching arbitrary text (both symbols and spaces) with ANTLR?

How to match any text in ANTLRv4? I mean text, which is unknown at the time of grammar writing?

My grammar is follows:

grammar Anytext;

line :
    comment;

comment : '#' anytext;

anytext: ANY*;

WS : [ \t\r\n]+;

ANY : .;

And my code is follows:

    String line = "# This_is_a_comment";

    ANTLRInputStream input = new ANTLRInputStream(line);

    AnytextLexer lexer = new AnytextLexer(input);

    CommonTokenStream tokens = new CommonTokenStream(lexer);

    AnytextParser parser = new AnytextParser(tokens);

    ParseTree tree = parser.comment();

    System.out.println(tree.toStringTree(parser)); // print LISP-style tree

Output follows:

line 1:1 extraneous input ' ' expecting {<EOF>, ANY}
(comment # (anytext   T h i s _ i s _ a _ c o m m e n t))

If I change ANY rule

ANY : [ \t\r\n.];

it stops recognizing any symbol at all.

UPDATE1

I have no end line character at the end.

UPDATE 2

So, I understood, that it is impossible to match any text with lexer since lexer can't allow multiple classes. If I define lexer rule for any symbol it will either hide all other rules or doesn't work.

But the question persists.

How to match all symbols at parser level then?

Suppose I have table-shaped data and I wan't to process some fields and ignore others. If I had anytext rule, I would write

infoline :
    ( codepoint WS 'field1' WS field1Value ) |
    ( codepoint WS 'field2' WS field2Value ) |
    ( codepoint WS anytext );

here I am parsing rows if 2nd column contains field1 and field2 values and ignore rows otherwise.

How to accomplish this approach?

like image 874
Suzan Cioc Avatar asked May 11 '13 11:05

Suzan Cioc


2 Answers

It's important to remember that ANTLR will break up your complete input into tokens before the parser ever sees the first token (at least it behaves this way). Your lexer grammar looks like the following.

T__0 : '#'; // implicit token created due to the use of '#' in parser rule comment

WS : [ \t\r\n]+;

ANY : .;

For your input, the tokens are the following:

  1. # (type T__0)
  2. [space] (type WS)
  3. T (type ANY)
  4. h (type ANY)
  5. i (type ANY)
  6. s (type ANY)
  7. _ (type ANY)
  8. i (type ANY)
  9. s (type ANY)
  10. _ (type ANY)
  11. a (type ANY)
  12. _ (type ANY)
  13. c (type ANY)
  14. o (type ANY)
  15. m (type ANY)
  16. m (type ANY)
  17. e (type ANY)
  18. n (type ANY)
  19. t (type ANY)

Your current grammar fails to parse because the WS token isn't allowed in the comment rule. It would parse this input (but may run into problems as you expand your grammar) if you used this:

// remember that '#' is its own token
anytext: (ANY | WS | '#')*;

What you could do is change comment to be a lexer rule, which consumes the # character along with whatever follows (in this case, to the end of the line):

grammar Anytext;

line : COMMENT;

COMMENT : '#' ~[\r\n]*;

WS : [ \t\r\n]+;

ANY : .;
like image 122
Sam Harwell Avatar answered Sep 29 '22 13:09

Sam Harwell


Use following rule for line comments:

LINE_COMMENT
    :   '#' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    ;

It matches '#' and any symbol until it gets to the end of line (unix/windows line breaks).

Edit by 280Z28: here is the exact same rule in ANTLR 4 syntax:

LINE_COMMENT
    :   '#' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN)
    ;
like image 45
hoaz Avatar answered Sep 29 '22 11:09

hoaz