Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANTLR grammar how to capture all characters to end of line

Tags:

c#

antlr

I'm trying to capture a command that looks like _SC play Piano 1 to a tree with 3 nodes "_SC" "play" and "Piano 1"

the grammar I've got so far is

grammar PBScript;
options {
output = AST;
language = CSharp2;
}

line    :       COMMAND WS ACTION;
COMMAND :   '_SC';
ACTION  :   'play';
WS  :   (' '|'\t')+ ;

When I create another rule to capture the "Piano 1" part like so:

grammar PBScript;
options {
output = AST;
language = CSharp2;
}

line    :       COMMAND WS ACTION WS PARAMETER;
COMMAND :   '_SC';
ACTION  :   'play';
WS  :   (' '|'\t')+;
PARAMETER
    :       (~('\n'|'\r'))+ ;

I get a MismatchedTokenException(6!=5). I get that the grammar is wrong and I know partially why it's wrong. It's ambiguous because WS overlaps PARAMETER. I just don't know how to fix it.

There are other actions besides _SC and PARAMETER should be optional there will even be a different line type eventually that looks like Name: blah blah blahwhere I'll at least need "Name" and "blah blah blah" in the tree just in case that matters, but right now I'm just trying to figure out what to use for PARAMETER.

~Tom

EDIT: The string "Piano 1" should be any string of non newline characters so from the first non whitespace after play to the end of the line.

like image 368
majinnaibu Avatar asked Feb 22 '23 07:02

majinnaibu


2 Answers

You can't use a PARAMETER rule like that in your lexer. ANTLR's lexer matches tokens greedily: so PARAMETER would gobble up the entire line: no COMMAND or ACTION tokens will ever be created.

To be able to match something to the end of the line, you'd need a parser rule for it. But then the parser must have a notion of what a new line is (i.e. the lexer will need to produce new-line tokens).

grammar T;

options {
  output=AST;
}

tokens {
  LINE;
  PARAMS;
}

line
 : COMMAND ACTION rest_of_line NL 
   -> ^(LINE COMMAND ACTION ^(PARAMS rest_of_line))
 ;

rest_of_line
 : ~NL* // match any token other than a line break zero or more times
 ;

COMMAND : '_SC';
ACTION  : 'play';
WORD    : ('a'..'z' | 'A'..'Z')+;
NUMBER  : '0'..'9';
WS      : (' '|'\t')+ {skip();};
NL      : '\r'? '\n' | '\r';

If you now parse your input "_SC play Piano 1" you'd end up with the following AST:

enter image description here

like image 107
Bart Kiers Avatar answered Mar 03 '23 22:03

Bart Kiers


This grammar will parse your _SC play Piano 1 statement:

grammar PBScript;
options {
language = CSharp2;
output=AST;
}
tokens
{
COMMAND;
ACTION;
PARAM;
}

program :   lines;

lines   :   line*;

line:   'command:' command  action parameter param_modifier 
    ;

command
    :   IDENTIFIER
    ->  ^(COMMAND IDENTIFIER)
    ;

action  :   IDENTIFIER
    ->      ^(ACTION IDENTIFIER)
    ;

parameter   :   IDENTIFIER
    ->  ^(PARAM IDENTIFIER)
    ;

param_modifier  :   INTEGER
    ;

IDENTIFIER  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INTEGER :   '0'..'9'+ 
    ;


COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

Then for the input:

command: _SC play Piano 1

command: _SR doSomething someInstrument 2

You will get following parse tree:

enter image description here

Then, when you make your AST grammar you should chek the names of the commands for you commands, for example: if Name Of command == _SC do something etc...

like image 45
vldmrrdjcc Avatar answered Mar 03 '23 21:03

vldmrrdjcc