Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Token return values in ANTLR 3 C

Tags:

c

antlr

I'm new to ANTLR, and I'm attempting to write a simple parser using C language target (antler3C). The grammar is simple enough that I'd like to have each rule return a value, eg:

number returns [long value]
 :
 ( INT {$value = $INT.ivalue;}
 | HEX {$value = $HEX.hvalue;}
 ) 
 ; 

HEX returns [long hvalue] 
    : '0' 'x' ('0'..'9'|'a'..'f'|'A'..'F')+  {$hvalue = strtol((char*)$text->chars,NULL,16);}
    ;

INT returns [long ivalue] 
    : '0'..'9'+    {$ivalue = strtol((char*)$text->chars,NULL,10);}
    ;

Each rule collects the return value of it's child rules until the topmost rule returns a nice struct full of my data.

As far as I can tell, ANTLR allows lexer rules (tokens, eg 'INT' & 'HEX') to return values just like parser rules (eg 'number'). However, the generated C code will not compile:

error C2228: left of '.ivalue' must have class/struct/union
error C2228: left of '.hvalue' must have class/struct/union

I did some poking around, and the errors make sense - the tokens end up as generic ANTLR3_COMMON_TOKEN_struct, which doesn't allow for a return value. So maybe the C target just doesn't support this feature. But like I said, I'm new to this, and before I go haring off to find another approach I want to confirm that I can't do it this way.

So the question is this: 'Does antler3C support return values for lexer rules, and if so what is the proper way to use them?'

like image 670
John Avatar asked Oct 14 '10 01:10

John


3 Answers

Not really any new information, just some details on what @bemace already mentioned.

No, lexer rules cannot have return values. See 4.3 Rules from The Definitive ANTLR reference:


Rule Arguments and Return Values

Just like function calls, ANTLR parser and tree parser rules can have arguments and return values. ANTLR lexer rules cannot have return values [...]


There are two options:

Option 1

You can do the transforming to a long in the parser rule number:

number returns [long value]
  :  INT {$value = Long.parseLong($INT.text);}
  |  HEX {$value = Long.parseLong($HEX.text.substring(2), 16);}
  ;

Option 2

Or create your own token that has, say, a toLong(): long method:

import org.antlr.runtime.*;

public class YourToken extends CommonToken {

  public YourToken(CharStream input, int type, int channel, int start, int stop) {
    super(input, type, channel, start, stop);
  }

  // your custom method
  public long toLong() {
    String text = super.getText();
    int radix = text.startsWith("0x") ? 16 : 10;
    if(radix == 16) text = text.substring(2);
    return Long.parseLong(text, radix);
  }
}

and define in the options {...} header in your grammar to use this token and override the emit(): Token method in your lexer class:

grammar Foo;

options{
  TokenLabelType=YourToken;
}

@lexer::members {
  public Token emit() {
    YourToken t = new YourToken(input, state.type, state.channel, 
        state.tokenStartCharIndex, getCharIndex()-1);
    t.setLine(state.tokenStartLine);
    t.setText(state.text);
    t.setCharPositionInLine(state.tokenStartCharPositionInLine);
    emit(t);
    return t;
  }
}

parse
  :  number {System.out.println("parsed: "+$number.value);} EOF
  ;

number returns [long value]
  :  INT {$value = $INT.toLong();}
  |  HEX {$value = $HEX.toLong();}
  ;

HEX
  :  '0' 'x' ('0'..'9'|'a'..'f'|'A'..'F')+
  ;

INT
  :  '0'..'9'+
  ;

When you generate a parser and lexer, and run this test class:

import org.antlr.runtime.*;
import java.io.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("0xCafE");
        FooLexer lexer = new FooLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        FooParser parser = new FooParser(tokens);
        parser.parse();
    }
}

it will produce the following output:

parsed: 51966

The first options seems the more practical in your case.

Note that, as you can see, the examples given are in Java. I have no idea if option 2 is supported in the C target/runtime. I decided to still post it to be able to use it as a future reference here on SO.

like image 164
Bart Kiers Avatar answered Nov 01 '22 17:11

Bart Kiers


Lexer rules must return Token objects, because that's what the Parser expects to work with. There may be a way to customize the type of token object used, but it's easier just to convert tokens to values in the lowest-level parser rules.

social_title returns [Name.Title title]
 : SIR { title = Name.Title.SIR; }
 | 'Dame' { title = Name.Title.DAME; }
 | MR { title = Name.Title.MR; }
 | MS { title = Name.Title.MS; }
 | 'Miss' { title = Name.Title.MISS; }
 | MRS { title = Name.Title.MRS; };
like image 31
Brad Mace Avatar answered Nov 01 '22 16:11

Brad Mace


There is a third option: You can pass an object as argument to the lexer rule. This object contains a member that represents the lexer's return value. Within the lexer rule, you can set the member. Outside the lexer rule, at the point you call it, you can get the member and do whatever you want with this 'return value'. This way of parameter passing corresponds to the 'var' parameters in Pascal or the 'out' parameters in C++ and other programming languages.

like image 1
Christian Avatar answered Nov 01 '22 17:11

Christian