Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracing specific tags from arbitrary plain text

I want to parse plain text comments and look for certain tags within them. The types of tags I'm looking for look like:

<name#1234>

Where "name" is a [a-z] string (from a fixed list) and "1234" represents a [0-9]+ number. These tags can occur within a string zero or more times and be surrounded by arbitrary other text. For example, the following strings are all valid:

"Hello <foo#56> world!"
"<bar#1>!"
"1 &lt; 2"
"+<baz#99>+<squid#0> and also<baz#99>.\n\nBy the way, maybe <foo#9876>"

The following strings are all NOT valid:

"1 < 2"
"<foo>"
"<bar#>"
"Hello <notinfixedlist#1234>"

The last one isn't valid because "notinfixedlist" isn't a supported named identifier.

I can easily parse this using simple regex, for example (I'm omitting named groups for simplicity sake):

<[a-z]+#\d+>

or specifying a fixed list directly:

<(foo|bar|baz|squid)#\d+>

but I'd like to use antlr for a few reasons:

  • I want anything that doesn't match that format to result in a parse error, so if the text contains "<" or ">" but doesn't match the pattern, it fails. Those characters must be escaped as "&lt;" and "&gt;" respectively if it's not a tag.
  • I may extend this in the future to support other kinds of patterns (eg: "{foo+666}" or "[[@1234]]" and would like to avoid an explosion of regex statements. Having a single grammar file I can extend would be great.
  • I like the fact that antlr4 implements the visitor pattern and my code gets called when a tag of a specific type is encountered instead of having to hack together varying regex.

How do I implement such a grammar using antlr4? Most of the examples I've seen are for languages which follow exact rules for the entire text, whereas I only want the grammar to apply to matching patterns within arbitrary text.

I've come up with this, which I believe is correct:

grammar Tags;

parse 
    : ( tag | text )*
    ;

tag 
    : '<' fixedlist '#' ID '>'
    ;

fixedlist 
    : 'foo' 
    | 'bar' 
    | 'baz' 
    | 'squid';

text 
    : ~('<' | '>')+
    ;

ID
    : [0-9]+
    ;

Is this correct?

like image 455
Nick B. Avatar asked Oct 18 '22 02:10

Nick B.


1 Answers

In general terms, the problem identified is typically described an island grammar problem - where sections of an otherwise singular document are described by two or more different, often mutually ambiguous, specifications.

ANTLR 4 directly supports island grammars through use of modes. Note, modes are only available in split lexer/parser grammars.

The parser

parser grammar TagsParser ;

options {
    tokenVocab = TagsLexer ;
}

parse   : ( tag | text )* EOF ;
tag     : LANGLE fixedlist GRIDLET ID RANGLE ;
text    : . ;
fixedlist
    : FOO
    | BAR
    | BAZ
    | SQUID
    ;

The lexer

lexer grammar TagsLexer ;

LANGLE  : '<' -> pushMode(tag) ;
TEXT    : . ;

mode tag ;
    RANGLE  : '>' -> popMode ;

    FOO     : 'foo' ;
    BAR     : 'bar' ;
    BAZ     : 'baz' ;
    SQUID   : 'squid' ;
    GRIDLET : '#' ;
    ID      : [0-9]+ ;

    NONTAG  : . -> popMode ;

The text rule in the parser will match all tokens not previously consumed by the parser rules above it. This will include all TEXT tokens as well as any text that happens to be matched by a tag mode rule but is not validly part of a tag.

like image 139
GRosenberg Avatar answered Oct 30 '22 23:10

GRosenberg