Characters Matching Multiple Lexer Rules in ANTLR

Tags:

I've defined multiple lexer rules that potentially matches the same character sequence. For example:

LBRACE:  '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*'  ;
AMPERSAND: '&'  ;

IGNORED_SYMBOLS:   ('!' | '#' | '%' | '^' | '-' | '+' | '=' | 
                    '\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/'  ) ;


// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};

STRING_LITERAL:  '"' (STR_ESC | ~( '"' ))* '"'; 
fragment STR_ESC:  '\\'  '"'  ; 

CHAR_LITERAL :  '\'' (CH_ESC | ~( '\'' )) '\''  ;  
fragment CH_ESC :  '\\' '\'';

My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.

This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?

What is the way the lexer decides which rules to apply if the characters match more than one rule?

757

asked Sep 24 '11 12:09

JavaMan

1 Answers

What is the way the lexer decides which rules to apply if the characters match more than one rule?

Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N number of characters and a later rule matches the same N characters plus 1 or more characters, then the later rule is matched (greedy match).

Take the following rules for example:

DO : 'do';
ID : 'a'..'z'+;

The input "do" would obviously be matched by the rule DO.

And input like: "done" would be greedily matched by ID. It is not tokenized as the 2 tokens: [DO:"do"] followed by [ID:"ne"].

121

answered Sep 28 '22 07:09

Bart Kiers

Related questions
                            
                                Is there a parser/way available to parser Wikipedia dump files using Python?
                            
                                What is a good Javascript RDFa parser implementation?
                            
                                ANTLR vs. Happy vs. other parser generators
                            
                                Library to parse SQL statements
                            
                                Appropriate uses for yacc/byacc/bison and lex/flex
                            
                                CSS parsing libraries for iPhone
                            
                                Converting chinese to pinyin [closed]
                            
                                C++: tools to statically analyze code (and/or preprocess it) [closed]
                            
                                Can scala's parser combinators parse binary files?
                            
                                Problem in compiling Java Source using ANTLR v3
                            
                                Is there any JavaScript function in some library for turning simple wiki mark up (given as multi line string) into html?
                            
                                VBScript Partial Parser
                            
                                Convert or parse wiki in c#
                            
                                What's the purpose of the Scala package scala.util.automata?
                            
                                Recursive Descent Parser for something simple?
                            
                                Parsing: library functions, FSM, explode() or lex/yacc?
                            
                                How to to extract a javascript function from a javascript file
                            
                                Extracting Fields Names of an HTML form - Python
                            
                                JavaScript JSON parser that tells error position
                            
                                Parsing DateTime with a known but not given time zone

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Characters Matching Multiple Lexer Rules in ANTLR

Tags:

parsing

antlr

lexer

JavaMan

People also ask

1 Answers

Bart Kiers

Recent Activity

Donate For Us