I need to read the first 15 lines from about 100 XML files that are up to 200,000 lines long. Is there a way to use something like BufferedReader to do this efficiently? The steps outlined in this question use DocumentBuilder.parse(String)
; this tries to parse the entire file at once.
EDIT: The first 15 elements contain metadata about the file (page names, last edited dates, etc) that I would like to parse into a table.
I suggest looking into a streaming XML parser; the use case for streaming APIs extends to reading files that are several 100s of GB which obviously cannot fit in memory.
In Java, the StAX API is a (fairly large) evolution of native SAX APIs. Look through the tutorial here on parsing "on the fly":
http://tutorials.jenkov.com/java-xml/stax.html
Here is probably what you want to do - as I wrote in comment, use SAX parser and when your condition for stopping is met use this
How to stop parsing xml document with SAX at any time?
edit:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<first>
<inner>data</inner>
</first>
<second>second</second>
<third>third</third>
<next>next</next>
</root>
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class ReadXmlUpToSomeElementSaxParser extends DefaultHandler {
private final String lastElementToRead;
public ReadXmlUpToSomeElementSaxParser(String lastElementToRead) {
this.lastElementToRead = lastElementToRead;
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
// just for showing what is parsed
System.out.println("startElement: " + qName);
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (lastElementToRead.equals(qName)) {
throw new MySaxTerminatorException();
}
}
public static void main(String[] args) throws Exception {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
try {
saxParser.parse("src/test.xml", new ReadXmlUpToSomeElementSaxParser("second"));
} catch (MySaxTerminatorException exp) {
// nothing to do, expected
}
}
public class MySaxTerminatorException extends SAXException {
}
}
startElement: root
startElement: first
startElement: inner
startElement: second
Why is that better? Simply because some application can send you
<?xml version="1.0" encoding="UTF-8"?>
<root>
<first><inner>data</inner></first>
<second>second</second>
<third>third</third>
<next>next</next>
</root>
and lines oriented approach will fail...
I provided the parser that is not counting elements to show that the condition can be defined based on business logic required to achieve...
For reading data in element you can use character()
method, but please be aware that
SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks
read more in JavaDoc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With