Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - parsing text file - Scanner, Reader or something else?

I'd like to parse an UTF8 encoded text file that may contain something like this:

int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]

The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[

I've done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it'll be useful for reading the ints and floats, but I'm not sure anymore.

The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.

Basically for this input

int list[-45,54, 435 ,-65]

I'd like to be able to call and get this

s.nextWord()   // int 
s.nextWord()   // list
s.nextSymbol() // [
s.nextInt()    // -45
s.nextSymbol() // ,
s.nextInt()    // 54
s.nextSymbol() // ,
s.nextInt()    // 435
s.nextSymbol() // ,
s.nextInt()    // -65
s.nextSymbol() // ]

and so on.

Or, if it couldn't parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.

Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don't seem to be able to find the right tool for it.

like image 201
Neil Avatar asked Apr 01 '26 12:04

Neil


1 Answers

I'm not an ANTLR expert, but this ANTLR grammar is capable to parse your code:

grammar Expressions;

expressions 
    :   expression+ EOF
    ;

expression 
    :   intExpression
    |   intListExpression
    |   floatExpression
    |   floatListExpression
    |   textExpression
    |   textListExpression
    ;

intExpression        :  intType INT;
intListExpression    :  intType listType '[' ( INT (',' INT)* )? ']';
floatExpression      :  floatType FLOAT;
floatListExpression  :  floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
textExpression       :  textType STRING;
textListExpression   :  textType listType '[' ( STRING (',' STRING)* )? ']';

intType   :  'int';
floatType :  'float';
textType  :  'text';
listType  :  'list';

INT :   '0'..'9'+
    ;

FLOAT
    :   ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
    |   '.' ('0'..'9')+ EXPONENT?
    |   ('0'..'9')+ EXPONENT
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.

For your input, this is the parse tree:

Parse Tree for OP input

Edit: I changed it to support empty lists.

like image 157
davidbuzatto Avatar answered Apr 03 '26 01:04

davidbuzatto



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!