Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partial read of xml file

I need to read the first 15 lines from about 100 XML files that are up to 200,000 lines long. Is there a way to use something like BufferedReader to do this efficiently? The steps outlined in this question use DocumentBuilder.parse(String); this tries to parse the entire file at once.

EDIT: The first 15 elements contain metadata about the file (page names, last edited dates, etc) that I would like to parse into a table.

like image 804
AnthonyW Avatar asked Dec 01 '22 19:12

AnthonyW


2 Answers

I suggest looking into a streaming XML parser; the use case for streaming APIs extends to reading files that are several 100s of GB which obviously cannot fit in memory.

In Java, the StAX API is a (fairly large) evolution of native SAX APIs. Look through the tutorial here on parsing "on the fly":

http://tutorials.jenkov.com/java-xml/stax.html

like image 41
Ishan Chatterjee Avatar answered Dec 04 '22 22:12

Ishan Chatterjee


Here is probably what you want to do - as I wrote in comment, use SAX parser and when your condition for stopping is met use this

How to stop parsing xml document with SAX at any time?

edit:

test.xml

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <first>
        <inner>data</inner>
    </first>
    <second>second</second>
    <third>third</third>
    <next>next</next>
</root>

ReadXmlUpToSomeElementSaxParser.java

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ReadXmlUpToSomeElementSaxParser extends DefaultHandler {

    private final String lastElementToRead;

    public ReadXmlUpToSomeElementSaxParser(String lastElementToRead) {
        this.lastElementToRead = lastElementToRead;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        // just for showing what is parsed
        System.out.println("startElement: " + qName);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if (lastElementToRead.equals(qName)) {
            throw new MySaxTerminatorException();
        }
    }

    public static void main(String[] args) throws Exception {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        try {
            saxParser.parse("src/test.xml", new ReadXmlUpToSomeElementSaxParser("second"));
        } catch (MySaxTerminatorException exp) {
            // nothing to do, expected
        }
    }

    public class MySaxTerminatorException extends SAXException {
    }

}

output

startElement: root
startElement: first
startElement: inner
startElement: second

Why is that better? Simply because some application can send you

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <first><inner>data</inner></first>
    <second>second</second>
    <third>third</third>
    <next>next</next>
</root>

and lines oriented approach will fail...

I provided the parser that is not counting elements to show that the condition can be defined based on business logic required to achieve...

characters() warning

For reading data in element you can use character() method, but please be aware that

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks

read more in JavaDoc

like image 179
Betlista Avatar answered Dec 04 '22 23:12

Betlista