I have a large xml file (1Gb). I need to make many queries on this xml file (using xpath for example). The results are small parts of the xml. I want the queries to be as fast as possible but the 1Gb file is probably too large for working memory.
The xml looks something like this:
<all>
<record>
<id>1</id>
... lots of fields. (Very different fields per record including (sometimes) subrecords
so mapping on a relational database would be hard).
</record>
<record>
<id>2</id>
... lots of fields.
</record>
.. lots and lots and lots of records
</all>
I need random access, selecting records using for instance as an key. (Id is most important, but other fields might be used as key too). I don't know the queries in advance, they arrive and have to be executed ASAP, no batch executing but real time. SAX does not look very promising because I don't want to reread the entire file for every query. But DOM doesn't look very promising either, because the file is very large and adding additional structure overhead will almost certainly mean that it is not going to fit in working memory.
Which java library / approach could I use best to handle this problem?
When handling XML you generally have two approaches: streaming (SAX) or loading the entire document into memory (various DOM implementations).
If you can pre-establish a set of queries to be processed in bulk, you could write a program to use SAX to stream the file, looking for matches. If the queries come in at random intervals (i.e. a typical database application) then you will need to either load the entire document into memory, or preprocess the XML document into a database of some kind.
A better description of what you're trying to accomplish might help get better answers.
vtd-xml is the best-fit for your usecase. http://vtd-xml.sourceforge.net/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With