How to match any text in ANTLRv4? I mean text, which is unknown at the time of grammar writing?
My grammar is follows:
grammar Anytext;
line :
comment;
comment : '#' anytext;
anytext: ANY*;
WS : [ \t\r\n]+;
ANY : .;
And my code is follows:
String line = "# This_is_a_comment";
ANTLRInputStream input = new ANTLRInputStream(line);
AnytextLexer lexer = new AnytextLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AnytextParser parser = new AnytextParser(tokens);
ParseTree tree = parser.comment();
System.out.println(tree.toStringTree(parser)); // print LISP-style tree
Output follows:
line 1:1 extraneous input ' ' expecting {<EOF>, ANY}
(comment # (anytext T h i s _ i s _ a _ c o m m e n t))
If I change ANY
rule
ANY : [ \t\r\n.];
it stops recognizing any symbol at all.
UPDATE1
I have no end line character at the end.
UPDATE 2
So, I understood, that it is impossible to match any text with lexer since lexer can't allow multiple classes. If I define lexer rule for any symbol it will either hide all other rules or doesn't work.
But the question persists.
How to match all symbols at parser level then?
Suppose I have table-shaped data and I wan't to process some fields and ignore others. If I had anytext
rule, I would write
infoline :
( codepoint WS 'field1' WS field1Value ) |
( codepoint WS 'field2' WS field2Value ) |
( codepoint WS anytext );
here I am parsing rows if 2nd column contains field1
and field2
values and ignore rows otherwise.
How to accomplish this approach?
It's important to remember that ANTLR will break up your complete input into tokens before the parser ever sees the first token (at least it behaves this way). Your lexer grammar looks like the following.
T__0 : '#'; // implicit token created due to the use of '#' in parser rule comment
WS : [ \t\r\n]+;
ANY : .;
For your input, the tokens are the following:
#
(type T__0
)WS
)T
(type ANY
)h
(type ANY
)i
(type ANY
)s
(type ANY
)_
(type ANY
)i
(type ANY
)s
(type ANY
)_
(type ANY
)a
(type ANY
)_
(type ANY
)c
(type ANY
)o
(type ANY
)m
(type ANY
)m
(type ANY
)e
(type ANY
)n
(type ANY
)t
(type ANY
)Your current grammar fails to parse because the WS
token isn't allowed in the comment
rule. It would parse this input (but may run into problems as you expand your grammar) if you used this:
// remember that '#' is its own token
anytext: (ANY | WS | '#')*;
What you could do is change comment
to be a lexer rule, which consumes the #
character along with whatever follows (in this case, to the end of the line):
grammar Anytext;
line : COMMENT;
COMMENT : '#' ~[\r\n]*;
WS : [ \t\r\n]+;
ANY : .;
Use following rule for line comments:
LINE_COMMENT
: '#' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
;
It matches '#' and any symbol until it gets to the end of line (unix/windows line breaks).
Edit by 280Z28: here is the exact same rule in ANTLR 4 syntax:
LINE_COMMENT
: '#' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN)
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With