Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java lazy reading of XML file?

Tags:

java

xml

I am wondering how I can lazily read a large XML file that doesn't fit into memory in Java. Let's assume the file is correctly formatted and we don't have to make a first pass to check this. Does someone know how to do this in Java?

Here is my fake file (real file is a Wikipedia dump which is 50+ Gb):

<pages>
  <page>
    <text> some data ....... </text>
  </page>
  <page>
    <text> MORE DATA ........ </text>
  </page>
</pages>

I was trying this with an XML library that is supposed to be able to do this but it's loading the whole thing into memory >:O

DOMParser domParser = new DOMParser();
//This is supposed to make it lazy-load the file, but it's not working
domParser.setFeature("http://apache.org/xml/features/dom/defer-node-expansion", true);
//Library says this needs to be set to use defer-node-expansion
domParser.setProperty("http://apache.org/xml/properties/dom/document-class-name", "org.apache.xerces.dom.DocumentImpl");

//THIS IS LOADING THE WHOLE FILE
domParser.parse(new InputSource(wikiXMLBufferedReader));

Document doc = domParser.getDocument();
NodeList pages = doc.getElementsByTagName("page");

for(int i = 0; i < pages.getLength(); i++) {
    Node pageNode = pages.item(i);
    //do something with page nodes
}

Do anyone know how to do this? Or what am I doing wrong in my attempt with this particular Java XML library?

Thanks.

like image 846
anthonybell Avatar asked Oct 28 '25 06:10

anthonybell


2 Answers

You should be looking at SAX parsers in Java. DOM parsers are built to read the entire XMLs, load into memory, and create java objects out of them. SAX parsers serially parse XML files and use an event based mechanism to process the data. Look at the differences here.

Here's a link to a SAX tutorial. Hope it helps.

like image 144
Vinay Rao Avatar answered Oct 29 '25 19:10

Vinay Rao


If you're prepared to buy a Saxon-EE license, then you can issue the simple query "copy-of(//page)", with execution options set to enable streaming, and it will return you an iterator over a sequence of trees each rooted at a page element; each of the trees will be fetched when you advance the iterator, and will be garbage-collected when you have finished with it. (That's assuming you really want to do the processing in Java; you could also do the processing in XQuery or XSLT, of course, which would probably save you many lines of code.)

If you have more time than money, and want a home-brew solution, then write a SAX filter which accepts parsing events from the XML parser and sends them on to a DocumentBuilder; every time you hit a startElement event for a page element, open a new DocumentBuilder; when the corresponding endElement event is notified, grab the tree that has been built by the DocumentBuilder, and pass it to your Java application for processing.

like image 41
Michael Kay Avatar answered Oct 29 '25 19:10

Michael Kay



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!