Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

Tags:

grammar

antlr

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:

expr 
    : special_ident
    | ident
    ;

special_ident : LETTER DIGIT;
ident         : LETTER (LETTER | DIGIT)*;

LETTER : 'A'..'Z';
DIGIT  : '0'..'9';

When I try to check this grammar, I get this warning:

Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2. As a result, alternative(s) 2 were disabled for that input

I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.

Here's some sample input and what I'd like it to match:

A      : ident
A1     : special_ident
A1A    : ident
A12    : ident
AA1    : ident

How can I form my grammar such that I correctly identify my two types of identifiers?

like image 956
Chris Farmer Avatar asked Jan 31 '10 21:01

Chris Farmer


2 Answers

Seems that you have 3 cases:

  • A
  • AN
  • A(A|N)(A|N)+

You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.

I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:

long_ident    : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident         : LETTER | long_ident;
like image 130
Carl Smotricz Avatar answered Nov 10 '22 15:11

Carl Smotricz


Expanding on Carl's thought, I would guess you have four different cases:

  1. A
  2. AN
  3. AA(A|N)*
  4. AN(A|N)+

Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.

prog 
    :    (expr WS)+ EOF;

expr 
    : special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
    | ident {System.out.println("Found ident:" + $ident.text + "\n");}
    ;

special_ident : LETTER DIGIT;

ident         : LETTER 
    |LETTER DIGIT (LETTER|DIGIT)+
    |LETTER LETTER (LETTER|DIGIT)*;

LETTER : 'A'..'Z';
DIGIT  : '0'..'9';
WS 
    :   (' '|'\t'|'\n'|'\r')+;
like image 2
WayneH Avatar answered Nov 10 '22 14:11

WayneH