Matching arbitrary text (both symbols and spaces) with ANTLR?

Question

How to match any text in ANTLRv4? I mean text, which is unknown at the time of grammar writing?

My grammar is follows:

grammar Anytext;

line :
    comment;

comment : '#' anytext;

anytext: ANY*;

WS : [ 	
]+;

ANY : .;

And my code is follows:

    String line = "# This_is_a_comment";

    ANTLRInputStream input = new ANTLRInputStream(line);

    AnytextLexer lexer = new AnytextLexer(input);

    CommonTokenStream tokens = new CommonTokenStream(lexer);

    AnytextParser parser = new AnytextParser(tokens);

    ParseTree tree = parser.comment();

    System.out.println(tree.toStringTree(parser)); // print LISP-style tree

Output follows:

line 1:1 extraneous input ' ' expecting {<EOF>, ANY}
(comment # (anytext   T h i s _ i s _ a _ c o m m e n t))

If I change ANY rule

ANY : [ 	
.];

it stops recognizing any symbol at all.

UPDATE1

I have no end line character at the end.

UPDATE 2

So, I understood, that it is impossible to match any text with lexer since lexer can't allow multiple classes. If I define lexer rule for any symbol it will either hide all other rules or doesn't work.

But the question persists.

How to match all symbols at parser level then?

Suppose I have table-shaped data and I wan't to process some fields and ignore others. If I had anytext rule, I would write

infoline :
    ( codepoint WS 'field1' WS field1Value ) |
    ( codepoint WS 'field2' WS field2Value ) |
    ( codepoint WS anytext );

here I am parsing rows if 2nd column contains field1 and field2 values and ignore rows otherwise.

How to accomplish this approach?

Sam Harwell · Accepted Answer

It's important to remember that ANTLR will break up your complete input into tokens before the parser ever sees the first token (at least it behaves this way). Your lexer grammar looks like the following.

T__0 : '#'; // implicit token created due to the use of '#' in parser rule comment

WS : [ 	
]+;

ANY : .;

For your input, the tokens are the following:

# (type T__0)
[space] (type WS)
T (type ANY)
h (type ANY)
i (type ANY)
s (type ANY)
_ (type ANY)
i (type ANY)
s (type ANY)
_ (type ANY)
a (type ANY)
_ (type ANY)
c (type ANY)
o (type ANY)
m (type ANY)
m (type ANY)
e (type ANY)
n (type ANY)
t (type ANY)

Your current grammar fails to parse because the WS token isn't allowed in the comment rule. It would parse this input (but may run into problems as you expand your grammar) if you used this:

// remember that '#' is its own token
anytext: (ANY | WS | '#')*;

What you could do is change comment to be a lexer rule, which consumes the # character along with whatever follows (in this case, to the end of the line):

grammar Anytext;

line : COMMENT;

COMMENT : '#' ~[
]*;

WS : [ 	
]+;

ANY : .;

hoaz · Answer

Use following rule for line comments:

LINE_COMMENT
    :   '#' ~('
'|'
')* '
'? '
' {$channel=HIDDEN;}
    ;

It matches '#' and any symbol until it gets to the end of line (unix/windows line breaks).

Edit by 280Z28: here is the exact same rule in ANTLR 4 syntax:

LINE_COMMENT
    :   '#' ~[
]* '
'? '
' -> channel(HIDDEN)
    ;

Matching arbitrary text (both symbols and spaces) with ANTLR?

Tags:

java

regex

antlr

lexer

antlr4

Suzan Cioc

2 Answers

Sam Harwell

hoaz

Recent Activity

Donate For Us

Matching arbitrary text (both symbols and spaces) with ANTLR?

Tags:

java

regex

antlr

lexer

antlr4

Suzan Cioc

2 Answers

Sam Harwell

hoaz

Related questions

Recent Activity

Donate For Us