Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing very large XML documents (and a bit more) in java

(All of the following is to be written in Java)

I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:

First, the stream will be decrypted according to the aforementioned algorithm.

Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.

Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.

Here is my question:

Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.

like image 909
Chris R Avatar asked Dec 10 '08 12:12

Chris R


People also ask

Which XML parser is best in Java for large files?

DOM Parser is faster than SAX Parser. Best for the larger sizes of files. Best for the smaller size of files. It is suitable for making XML files in Java.

What is the best way to parse XML in Java?

Java XML Parser - DOM DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.

Which of the XML parsers can parse a large XML with a little amount of memory footprint?

DOM4J works with DOM, SAX, XPath, and XSLT. It can parse large XML documents with very low memory footprint.


1 Answers

Stax is the right way. I would recommend looking at Woodstox

like image 156
mzehrer Avatar answered Sep 22 '22 19:09

mzehrer