Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to parse big file with ANTLR?

Is it possible to instruct ANTLR not to load entire file into memory? Can it apply rules one by one and generate topmost list of nodes sequentially, along with reading file? Also may be it is possible to drop analyzed nodes somehow?

like image 974
Suzan Cioc Avatar asked May 08 '13 04:05

Suzan Cioc


2 Answers

Yes, you can use:

  • UnbufferedCharStream for your character stream (passed to lexer)
  • UnbufferedTokenStream for your token stream (passed to parser)
    • This token stream implementation doesn't differentiate on token channels, so make sure to use ->skip instead of ->channel(HIDDEN) as the command in your lexer rules that shouldn't be sent to the parser.
  • Make sure to call setBuildParseTree(false) on your parser or a giant parse tree will be created for the entire file.

Edit with some additional commentary:

  • I put quite a bit of work into making sure UnbufferedCharStream and UnbufferedTokenStream operate in the most "sane" manner possible, especially in relation to the mark, release, seek, and getText methods. My goal was to preserve as much of the functionality of those methods as possible without compromising the ability of the stream to release unused memory.
  • ANTLR 4 allows for true unlimited lookahead. If your grammar requires lookahead to EOF to make a decision, then you would not be able to avoid loading the entire input into memory. You'll have to take great care to avoid this situation when writing your grammar.
like image 83
Sam Harwell Avatar answered Oct 19 '22 06:10

Sam Harwell


There is a Wiki page buried somewhere on Antlr.org that speaks to your question; cannot seem to find in just now.

In substance, the lexer reads data using a standard InputStream interface, specifically ANTLRInputStream.java. The typical implementation is ANTLRFileStream.java that preemptively reads the entire input data file into memory. What you need to do is to write your own buffered version -"ANTLRBufferedFileStream.java"- that reads from the source file as needed. Or, just set a standard BufferedInputStream/FileInputStream as the data source to the AntlrInputStream.

One caveat is that Antlr4 has the potential for doing an unbounded lookahead. Not likely a problem for a reasonably sized buffer in normal operation. More likely when the parser attempts error recovery. Antlr4 allows for tailoring of the error recovery strategy, so the problem is manageable.

Additional detail:

In effect, Antlr implements a pull-parser. When you call the first parser rule, the parser requests tokens from the lexer, which requests character data from the input stream. The parser/lexer interface is implemented by a buffered token stream, nominally BufferedTokenStream.

The parse tree is little more than a tree data structure of tokens. Well, a lot more, but not in terms of data size. Each token is an INT value backed typically by a fragment of the input data stream that matched the token definition. The lexer itself does not require a full copy of the lex'd input character stream to be kept in memory. And, the token text fragments could be zero'd out. The critical memory requirement for the lexer is the input character stream lookahead scan, given a buffered file input stream.

Depending on your needs, the in-memory parse tree can be small even given a 100GB+ input file.

To help further, you need to explain more what it is you are trying to do in Antlr and what defines your minimum critical memory requirement. That will guide which additional strategies can be recommended. For example, if the source data is amenable, you can use multiple lexer/parser runs, each time subselecting in the lexer different portions of the source data to process. Compared to file reads and DB writes, even with fast disks, Antlr execution will likely be barely noticeable.

like image 28
GRosenberg Avatar answered Oct 19 '22 06:10

GRosenberg