I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL
rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T
:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A
lexer-rule in the foo
parser-rule. The foo
rule does now not match any character except the 'a'
, but it matches any lexer rule except A
. In other words, it will only match a 'b'
or 'c'
character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data
- and text
-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment
part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine
's and TextLine
's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With