Hello everyone,
This could be a follow-up on this question: Antlr rule priorities
I'm trying to write an ANTLR grammar for the reStructuredText markup language.
The main problem I'm facing is : "How to match any sequence of characters (regular text) without masking other grammar rules?"
Let's take an example with a paragraph with inline markup:
In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element
*before* the insert point. The variable ``after_ptr`` points to the element *after* the
insert. In other words, we are going to put our new element **in between** ``before_ptr``
and ``after_ptr``.
I thought that writing rules for inline markup text would be easy. So I wrote a simple grammar:
grammar Rst;
options {
output=AST;
language=Java;
backtrack=true;
//memoize=true;
}
@members {
boolean inInlineMarkup = false;
}
// PARSER
text
: inline_markup (WS? inline_markup)* WS? EOF
;
inline_markup
@after {
inInlineMarkup = false;
}
: {!inInlineMarkup}? (emphasis|strong|litteral|link)
;
emphasis
@init {
inInlineMarkup = true;
}
: '*' (~'*')+ '*' {System.out.println("emphasis: " + $text);}
;
strong
@init {
inInlineMarkup = true;
}
: '**' (~'*')+ '**' {System.out.println("bold: " + $text);}
;
litteral
@init {
inInlineMarkup = true;
}
: '``' (~'`')+ '``' {System.out.println("litteral: " + $text);}
;
link
@init {
inInlineMarkup = true;
}
: inline_internal_target
| footnote_reference
| hyperlink_reference
;
inline_internal_target
: '_`' (~'`')+ '`' {System.out.println("inline_internal_target: " + $text);}
;
footnote_reference
: '[' (~']')+ ']_' {System.out.println("footnote_reference: " + $text);}
;
hyperlink_reference
: ~(' '|'\t'|'\u000C'|'_')+ '_' {System.out.println("hyperlink_reference: " + $text);}
| '`' (~'`')+ '`_' {System.out.println("hyperlink_reference (long): " + $text);}
;
// LEXER
WS
: (' '|'\t'|'\u000C')+
;
NEWLINE
: '\r'? '\n'
;
This simple grammar doesn't work. And I didn't even try to match regular text...
My questions:
Thanks in advance for your help :-)
Robin
Thank you very much for your help! I would have had a hard time figuring my errors... I'm not writing that grammar (only) to learn ANTLR, I'm trying to code an IDE plugin for eclipse. And for that, I need a grammar ;)
I managed to go further in the grammar and wrote a text
rule:
grammar Rst;
options {
output=AST;
language=Java;
}
@members {
boolean inInlineMarkup = false;
}
//////////////////
// PARSER RULES //
//////////////////
file
: line* EOF
;
line
: text* NEWLINE
;
text
: inline_markup
| normal_text
;
inline_markup
@after {
inInlineMarkup = false;
}
: {!inInlineMarkup}? {inInlineMarkup = true;}
(
| STRONG
| EMPHASIS
| LITTERAL
| INTERPRETED_TEXT
| SUBSTITUTION_REFERENCE
| link
)
;
link
: INLINE_INTERNAL_TARGET
| FOOTNOTE_REFERENCE
| HYPERLINK_REFERENCE
;
normal_text
: {!inInlineMarkup}?
~(EMPHASIS
|SUBSTITUTION_REFERENCE
|STRONG
|LITTERAL
|INTERPRETED_TEXT
|INLINE_INTERNAL_TARGET
|FOOTNOTE_REFERENCE
|HYPERLINK_REFERENCE
|NEWLINE
)
;
//////////////////
// LEXER TOKENS //
//////////////////
EMPHASIS
: STAR ANY_BUT_STAR+ STAR {System.out.println("EMPHASIS: " + $text);}
;
SUBSTITUTION_REFERENCE
: PIPE ANY_BUT_PIPE+ PIPE {System.out.println("SUBST_REF: " + $text);}
;
STRONG
: STAR STAR ANY_BUT_STAR+ STAR STAR {System.out.println("STRONG: " + $text);}
;
LITTERAL
: BACKTICK BACKTICK ANY_BUT_BACKTICK+ BACKTICK BACKTICK {System.out.println("LITTERAL: " + $text);}
;
INTERPRETED_TEXT
: BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("LITTERAL: " + $text);}
;
INLINE_INTERNAL_TARGET
: UNDERSCORE BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("INLINE_INTERNAL_TARGET: " + $text);}
;
FOOTNOTE_REFERENCE
: L_BRACKET ANY_BUT_BRACKET+ R_BRACKET UNDERSCORE {System.out.println("FOOTNOTE_REFERENCE: " + $text);}
;
HYPERLINK_REFERENCE
: BACKTICK ANY_BUT_BACKTICK+ BACKTICK UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (long): " + $text);}
| ANY_BUT_ENDLINK+ UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (short): " + $text);}
;
WS
: (' '|'\t')+ {$channel=HIDDEN;}
;
NEWLINE
: '\r'? '\n' {$channel=HIDDEN;}
;
///////////////
// FRAGMENTS //
///////////////
fragment ANY_BUT_PIPE
: ESC PIPE
| ~(PIPE|'\n'|'\r')
;
fragment ANY_BUT_BRACKET
: ESC R_BRACKET
| ~(R_BRACKET|'\n'|'\r')
;
fragment ANY_BUT_STAR
: ESC STAR
| ~(STAR|'\n'|'\r')
;
fragment ANY_BUT_BACKTICK
: ESC BACKTICK
| ~(BACKTICK|'\n'|'\r')
;
fragment ANY_BUT_ENDLINK
: ~(UNDERSCORE|' '|'\t'|'\n'|'\r')
;
fragment ESC
: '\\'
;
fragment STAR
: '*'
;
fragment BACKTICK
: '`'
;
fragment PIPE
: '|'
;
fragment L_BRACKET
: '['
;
fragment R_BRACKET
: ']'
;
fragment UNDERSCORE
: '_'
;
The grammar is working fine for inline_markup but normal_text is not matched.
Here is my test class:
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.tree.Tree;
public class Test {
public static void main(String[] args) throws RecognitionException, IOException {
InputStream is = Test.class.getResourceAsStream("test.rst");
Reader r = new InputStreamReader(is);
StringBuilder source = new StringBuilder();
char[] buffer = new char[1024];
int readLenght = 0;
while ((readLenght = r.read(buffer)) > 0) {
if (readLenght < buffer.length) {
source.append(buffer, 0, readLenght);
} else {
source.append(buffer);
}
}
r.close();
System.out.println(source.toString());
ANTLRStringStream in = new ANTLRStringStream(source.toString());
RstLexer lexer = new RstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RstParser parser = new RstParser(tokens);
RstParser.file_return out = parser.file();
System.out.println(((Tree)out.getTree()).toStringTree());
}
}
And the input file I use:
In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element
*before* the insert point. The variable ``after_ptr`` points to the |element| *after* the
insert. In other words, `we are going`_ to put_ our new element **in between** ``before_ptr``
and ``after_ptr``.
And I get this output:
HYPERLINK_REFERENCE (short): 7-6`_
line 1:2 mismatched character ' ' expecting '_'
line 1:10 mismatched character ' ' expecting '_'
line 1:18 mismatched character ' ' expecting '_'
line 1:21 mismatched character ' ' expecting '_'
line 1:26 mismatched character ' ' expecting '_'
line 1:37 mismatched character ' ' expecting '_'
LITTERAL: `before_ptr`
line 1:86 no viable alternative at character '\r'
line 1:55 mismatched character ' ' expecting '_'
line 1:60 mismatched character ' ' expecting '_'
line 1:63 mismatched character ' ' expecting '_'
line 1:70 mismatched character ' ' expecting '_'
line 1:73 mismatched character ' ' expecting '_'
line 1:77 mismatched character ' ' expecting '_'
line 1:85 mismatched character ' ' expecting '_'
EMPHASIS: *before*
line 2:12 mismatched character ' ' expecting '_'
line 2:19 mismatched character ' ' expecting '_'
line 2:26 mismatched character ' ' expecting '_'
LITTERAL: `after_ptr`
line 2:30 mismatched character ' ' expecting '_'
line 2:39 mismatched character ' ' expecting '_'
line 2:90 no viable alternative at character '\r'
line 2:60 mismatched character ' ' expecting '_'
line 2:63 mismatched character ' ' expecting '_'
line 2:67 mismatched character ' ' expecting '_'
line 2:77 mismatched character ' ' expecting '_'
line 2:85 mismatched character ' ' expecting '_'
line 2:89 mismatched character ' ' expecting '_'
line 3:7 mismatched character ' ' expecting '_'
line 3:10 mismatched character ' ' expecting '_'
line 3:16 mismatched character ' ' expecting '_'
line 3:23 mismatched character ' ' expecting '_'
line 3:27 mismatched character ' ' expecting '_'
line 3:31 mismatched character ' ' expecting '_'
line 3:42 mismatched character ' ' expecting '_'
line 3:51 mismatched character ' ' expecting '_'
line 3:55 mismatched character ' ' expecting '_'
line 3:63 mismatched character ' ' expecting '_'
line 3:94 mismatched character '\r' expecting '*'
line 4:3 mismatched character ' ' expecting '_'
line 4:18 no viable alternative at character '\r'
line 4:18 mismatched character '\r' expecting '_'
HYPERLINK_REFERENCE (short): oing`_
HYPERLINK_REFERENCE (short): ut_
EMPHASIS: *in between*
LITTERAL: `after_ptr`
BR.recoverFromMismatchedToken
line 0:-1 mismatched input '<EOF>' expecting NEWLINE
null
Can you point to my error(s)? (the parser works for inline markup without errors when I add the filter=true; option to the grammar)
Robin
Here's a quick demo how you could parse this reStructeredText. Note that it just handles a minor set of all available markup-syntax, and by adding more to it, you will affect the existing parser/lexer rules: so there is much, much more work to be done!
grammar RST;
options {
output=AST;
backtrack=true;
memoize=true;
}
tokens {
ROOT;
PARAGRAPH;
INDENTATION;
LINE;
WORD;
BOLD;
ITALIC;
INTERPRETED_TEXT;
INLINE_LITERAL;
REFERENCE;
}
parse
: paragraph+ EOF -> ^(ROOT paragraph+)
;
paragraph
: line+ -> ^(PARAGRAPH line+)
| Space* LineBreak -> /* omit line-breaks between paragraphs from AST */
;
line
: indentation text+ LineBreak -> ^(LINE text+)
;
indentation
: Space* -> ^(INDENTATION Space*)
;
text
: styledText
| interpretedText
| inlineLiteral
| reference
| Space
| Star
| EscapeSequence
| Any
;
styledText
: bold
| italic
;
bold
: Star Star boldAtom+ Star Star -> ^(BOLD boldAtom+)
;
italic
: Star italicAtom+ Star -> ^(ITALIC italicAtom+)
;
boldAtom
: ~(Star | LineBreak)
| italic
;
italicAtom
: ~(Star | LineBreak)
| bold
;
interpretedText
: BackTick interpretedTextAtoms BackTick -> ^(INTERPRETED_TEXT interpretedTextAtoms)
;
interpretedTextAtoms
: ~BackTick+
;
inlineLiteral
: BackTick BackTick inlineLiteralAtoms BackTick BackTick -> ^(INLINE_LITERAL inlineLiteralAtoms)
;
inlineLiteralAtoms
: inlineLiteralAtom+
;
inlineLiteralAtom
: ~BackTick
| BackTick ~BackTick
;
reference
: Any+ UnderScore -> ^(REFERENCE Any+)
;
UnderScore
: '_'
;
BackTick
: '`'
;
Star
: '*'
;
Space
: ' '
| '\t'
;
EscapeSequence
: '\\' ('\\' | '*')
;
LineBreak
: '\r'? '\n'
| '\r'
;
Any
: .
;
When you generate a parser and lexer from the above, and let it parse the following input file:
***x*** **yyy** *zz* * a b c P2 ``*a*`b`` `q` Python_
(note the trailing line break!)
the parser will produce the following AST:
The graph can be created by running this class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"***x*** **yyy** *zz* *\n" +
"a b c\n" +
"\n" +
"P2 ``*a*`b`` `q`\n" +
"Python_\n";
RSTLexer lexer = new RSTLexer(new ANTLRStringStream(source));
RSTParser parser = new RSTParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
or if your source comes from a file, do:
RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst"));
or
RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst", "???"));
where "???"
is the encoding of your file.
The class above will print the AST as a DOT file to the console. You can use a DOT viewer to display the AST. In this case, I posted an image created by kgraphviewer. But there are many more viewers around. A nice online one is this one, which appears to be using kgraphviewer under "the hood". Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With