Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random queries on a large xml file

Tags:

java

xml

I have a large xml file (1Gb). I need to make many queries on this xml file (using xpath for example). The results are small parts of the xml. I want the queries to be as fast as possible but the 1Gb file is probably too large for working memory.

The xml looks something like this:

<all>
  <record>
      <id>1</id>
      ... lots of fields. (Very different fields per record including (sometimes) subrecords 
      so mapping on a relational database would be hard).
  </record>
  <record>
      <id>2</id>
      ... lots of fields.
  </record>
  .. lots and lots and lots of records
</all>

I need random access, selecting records using for instance as an key. (Id is most important, but other fields might be used as key too). I don't know the queries in advance, they arrive and have to be executed ASAP, no batch executing but real time. SAX does not look very promising because I don't want to reread the entire file for every query. But DOM doesn't look very promising either, because the file is very large and adding additional structure overhead will almost certainly mean that it is not going to fit in working memory.

Which java library / approach could I use best to handle this problem?

like image 724
Jan Avatar asked Jul 07 '10 15:07

Jan


2 Answers

When handling XML you generally have two approaches: streaming (SAX) or loading the entire document into memory (various DOM implementations).

If you can pre-establish a set of queries to be processed in bulk, you could write a program to use SAX to stream the file, looking for matches. If the queries come in at random intervals (i.e. a typical database application) then you will need to either load the entire document into memory, or preprocess the XML document into a database of some kind.

A better description of what you're trying to accomplish might help get better answers.

like image 76
Jim Garrison Avatar answered Oct 02 '22 15:10

Jim Garrison


vtd-xml is the best-fit for your usecase. http://vtd-xml.sourceforge.net/

like image 42
Aravind Yarram Avatar answered Oct 02 '22 14:10

Aravind Yarram