I want to parse plain text comments and look for certain tags within them. The types of tags I'm looking for look like:
<name#1234>
Where "name" is a [a-z] string (from a fixed list) and "1234" represents a [0-9]+ number. These tags can occur within a string zero or more times and be surrounded by arbitrary other text. For example, the following strings are all valid:
"Hello <foo#56> world!"
"<bar#1>!"
"1 < 2"
"+<baz#99>+<squid#0> and also<baz#99>.\n\nBy the way, maybe <foo#9876>"
The following strings are all NOT valid:
"1 < 2"
"<foo>"
"<bar#>"
"Hello <notinfixedlist#1234>"
The last one isn't valid because "notinfixedlist" isn't a supported named identifier.
I can easily parse this using simple regex, for example (I'm omitting named groups for simplicity sake):
<[a-z]+#\d+>
or specifying a fixed list directly:
<(foo|bar|baz|squid)#\d+>
but I'd like to use antlr for a few reasons:
How do I implement such a grammar using antlr4? Most of the examples I've seen are for languages which follow exact rules for the entire text, whereas I only want the grammar to apply to matching patterns within arbitrary text.
I've come up with this, which I believe is correct:
grammar Tags;
parse
: ( tag | text )*
;
tag
: '<' fixedlist '#' ID '>'
;
fixedlist
: 'foo'
| 'bar'
| 'baz'
| 'squid';
text
: ~('<' | '>')+
;
ID
: [0-9]+
;
Is this correct?
In general terms, the problem identified is typically described an island grammar problem - where sections of an otherwise singular document are described by two or more different, often mutually ambiguous, specifications.
ANTLR 4 directly supports island grammars through use of mode
s. Note, modes are only available in split lexer/parser grammars.
The parser
parser grammar TagsParser ;
options {
tokenVocab = TagsLexer ;
}
parse : ( tag | text )* EOF ;
tag : LANGLE fixedlist GRIDLET ID RANGLE ;
text : . ;
fixedlist
: FOO
| BAR
| BAZ
| SQUID
;
The lexer
lexer grammar TagsLexer ;
LANGLE : '<' -> pushMode(tag) ;
TEXT : . ;
mode tag ;
RANGLE : '>' -> popMode ;
FOO : 'foo' ;
BAR : 'bar' ;
BAZ : 'baz' ;
SQUID : 'squid' ;
GRIDLET : '#' ;
ID : [0-9]+ ;
NONTAG : . -> popMode ;
The text
rule in the parser will match all tokens not previously consumed by the parser rules above it. This will include all TEXT
tokens as well as any text that happens to be matched by a tag mode rule but is not validly part of a tag.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With