Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use antlr to parse partial data?

Tags:

antlr

I am trying to use antlr to parse a log file. Because I am only interested in partial part of the log, I want to only write a partial parser to process important part.

ex: I want to parse the segment:

[ 123 begin ]

So I wrote the grammar:

log :   
    '[' INT 'begin' ']'
    ;


INT : '0'..'9'+
    ;


NEWLINE
    : '\r'? '\n'
    ;

WS
    : (' '|'\t')+ {skip();}
    ;

But the segment may appear at the middle of a line, ex:

 111 [ 123 begin ] 222

According to the discussion: What is the wrong with the simple ANTLR grammar? I know why my grammar can't process above statement.

I want to know, is there any way to make antlr ignore any error, and continue to process remaining text?

Thanks for any advice! Leon

like image 371
Leon Chen Avatar asked Nov 04 '12 14:11

Leon Chen


People also ask

What can ANTLR do?

ANTLR is a powerful parser generator that you can use to read, process, execute, or translate structured text or binary files. It's widely used in academia and industry to build all sorts of languages, tools, and frameworks. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.

What does ANTLR generate?

ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers. Parsers can automatically generate parse trees or abstract syntax trees, which can be further processed with tree parsers. ANTLR provides a single consistent notation for specifying lexers, parsers, and tree parsers.

What is lexer and parser in ANTLR?

A lexer (often called a scanner) breaks up an input stream of characters into vocabulary symbols for a parser, which applies a grammatical structure to that symbol stream.

How does ANTLR lexer work?

An ANTLR lexer creates a Token object after matching a lexical rule. Each request for a token starts in Lexer. nextToken , which calls emit once it has identified a token. emit collects information from the current state of the lexer to build the token.


1 Answers

Since '[' might also be skipped in certain cases outside of [ 123 begin ], there's no way to handle this in the lexer. You'll have to create a parser rule that matches token(s) to be skipped (see the noise rule).

You'll also need to create a fall-through rule that matches any character if none of the other lexer rules matches (see the ANY rule).

A quick demo:

grammar T;

parse
    : ( log {System.out.println("log=" + $log.text);}
      | noise
      )*
      EOF
    ;

log : OBRACK INT BEGIN CBRACK
    ;

noise
    : ~OBRACK                  // any token except '['
    | OBRACK ~INT              // a '[' followed by any token except an INT
    | OBRACK INT ~BEGIN        // a '[', an INT and any token except an BEGIN
    | OBRACK INT BEGIN ~CBRACK // a '[', an INT, a BEGIN and any token except ']'
    ;

BEGIN   : 'begin';
OBRACK  : '[';
CBRACK  : ']';
INT     : '0'..'9'+;
NEWLINE : '\r'? '\n';
WS      : (' '|'\t')+ {skip();};
ANY     : .;
like image 60
Bart Kiers Avatar answered Nov 17 '22 11:11

Bart Kiers