Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Antlr token priority

Tags:

java

antlr

I have a rule definition like this:

reference: volume':'first_page'-'last_page ;

volume: INTEGER;
first_page: INTEGER;
last_page: INTEGER;

INTEGER: [0-9]+;

FREE_TEXT_WORD: NON_SPACE+;

fragment NON_SPACE : ~[ \r\n\t];

Given the input "168:321-331", I thought it would match the reference rule. But in reality, the whole string is tokenized as a FREE_TEXT_WORD.

How can I make the INTEGER token take preference over FREE_TEXT_WORD in this case?

Thanks.

like image 322
Wudong Avatar asked Aug 21 '13 15:08

Wudong


People also ask

How does ANTLR lexer work?

A lexer (often called a scanner) breaks up an input stream of characters into vocabulary symbols for a parser, which applies a grammatical structure to that symbol stream.

What is token in Antlr?

Antlr - Lexer Rule (Token names|Lexical Rule) in Antlr. They are rules that defines tokens. They are written generally in the grammar but may be written in a lexer grammar file Each lexer rule is either matched or not so every lexer rule expressi "...

How do you write an Antlr grammar?

Add the package name that you want to see in the Java file in which the lexer and parser files will be created. Add the Language in which you want the output like Java , Python etc. Tick the generate parser tree listener and generate tree visitor if you want to modify the visitor. Now the configuration is done.

What is Antlr fragment?

Fragments are reusable parts of lexer rules which cannot match on their own - they need to be referenced from a lexer rule.


1 Answers

ANTLR will always use a longer token over a shorter token, so to correct this situation you must do one of the following things:

  1. Make the FREE_TEXT_WORD not match more than 3 characters for the input 168:321-331, e.g. by not allowing it to contain a digit, or possibly removing the rule altogether.

    • You could also change FREE_TEXT_WORD to FREE_TEXT_CHARACTER. By limiting the rule to only matching a single character, it will never be longer than another token so its priority will be determined by its position in the grammar. You would then need to create a parser rule for words:

      freeTextWord : FREE_TEXT_CHARACTER+;
      
  2. Move the FREE_TEXT_WORD token into a mode which is not enabled at the point where your input reaches 168:321-331.

like image 57
Sam Harwell Avatar answered Sep 22 '22 13:09

Sam Harwell