Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Antlr for parsing data from never-ending stream

Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse? According to my observation, the lexer does not emit the current token until the first character of next token is received. On top of that - the parser seems not to emit the rule until the first token of next rule is received. Here is a simple grammar I tried:

fox: 'quick' 'brown' 'fox' '\r'? '\n' ;

Then I used the generated parser with UnbufferedCharStream and UnbufferedTokenStream:

  CharStream input = new UnbufferedCharStream(is);
  MyLexer lex = new MyLexer(input);
  lex.setTokenFactory(new CommonTokenFactory(true));
  TokenStream tokens = new UnbufferedTokenStream(lex);
  MyParser parser = new MyParser(tokens);
  MyParser.FoxContext fox = parser.fox();

when the stream gets 'quick' - nothing happens.

when 'b' comes in - entering rule 'fox'

then 'roun' - nothing (2 tokens are in the stream - none of them is known to leser yet!)

only after 'f' the listener visits the first token: 'quick'

then - nothing on 'ox'

on new line (unix): visit token 'brown'

Now the stream has all data (4 tokens), but only 2 tokens are recognized.

I found that in order to push those tokens through the system the stream can emit 2 tokens, that is any tokens known to the grammar. It could be 2 extra new lines, or let's say 'fox' and 'brown'. Only then the tokens 'fox' and '\n' get visited, the parser exits rule 'fox' and parsing gets finished.

Is that a bug or a feature? Is there a way to eliminate that lag?

Thanks!

like image 448
AndreyP Avatar asked Feb 13 '13 22:02

AndreyP


People also ask

What is Antlr used for?

What is ANTLR? ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. Terence Parr is a tech lead at Google and until 2022 was a professor of data science / computer science at Univ.

Can Antlr generate AST?

You can just parse a string by passing it to the parser, and it will automatically generate an AST from it which can then be used in your application.

What does Antlr generate?

ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers. Parsers can automatically generate parse trees or abstract syntax trees, which can be further processed with tree parsers. ANTLR provides a single consistent notation for specifying lexers, parsers, and tree parsers.

What is Antlr lexer?

A lexer is recognizer that draws input symbols from a character stream. lexer grammars result in a subclass of this object. A Lexer object uses simplified match() and error recovery mechanisms in the interest of speed.


1 Answers

The ANTLR 4 book was originally going to contain an example of parsing a streaming input, but I argued against it due to the severe complications that will inevitably arise from the use of an adaptive unlimited lookahead parser for something like this.

ANTLR 4 has no guaranteed lookahead bound (and no way to tell it to look for or even attempt to enforce one), so any implementation that operates on a blocking stream has the possibility of deadlock without returning information about the parse leading up to that point. I wouldn't even entertain the possibility of parsing a streaming input unless I saw an intermediate buffer in place first.

  1. Take all available (or previously unparsed) input and place it in a String or char[].
  2. Create an ANTLRInputStream for the buffer.
  3. Attempt to lex/parse this stream, which will have an implicit EOF on the end.

The result of the parse will tell you whether to discard the results to that point, or hold on to them to retry when more data is available:

  • If no syntax error occurs, the input was successfully parsed, and you can parse the next section of input when it becomes available later.

  • If a syntax error is reported before the EOF token is consumed, then a syntax error appears in the actual input so you'll want to handle it (report it to the user, etc...).

  • If a syntax error is reported at the point where the EOF token is consumed then additional input may resolve the problem - ignore the results of the current parse, and then retry once more data is available from the input stream.

like image 164
Sam Harwell Avatar answered Sep 24 '22 03:09

Sam Harwell