I have a huge XML file(15 GB). I want to convert a 'text' tag in XML file to a single page.
Sample XML file:
<root>
<page>
<id> 1 </id>
<text>
.... 1000 to 50000 lines of text
</text>
</page>
... Like wise 2 Million `page` tags
</root>
I've initially used DOM parser, but it throws JAVA OUT OF MEMORY(Valid). Now, I've written JAVA code using STAX. It works good, but performance is really slow.
This is the code I've written:
XMLEventReader xMLEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(filePath));
while(xMLEventReader.hasNext()){
xmlEvent = xMLEventReader.nextEvent();
switch(xmlEvent.getEventType()){
case XMLStreamConstants.START_ELEMENT:
if( element == "text")
isText = true;
break;
case XMLStreamConstants.CHARACTERS:
chars = (Characters) xmlEvent;
if(! (chars.isWhiteSpace() || chars.isIgnorableWhiteSpace()))
if(isText)
pageContent += chars.getData() + '\n';
break;
case XMLStreamConstants.END_ELEMENT:
String elementEnd = (((EndElement) xmlEvent).getName()).getLocalPart();
if( elementEnd == "text" )
{
createFile(id, pageContent);
pageContent = "";
isText = false;
}
break;
}
}
This code is working good.(Ignore about any minor errors). According to my understanding, XMLStreamConstants.CHARACTERS iterates for each and everyline of text tag. If TEXT tag has 10000 lines in it, XMLStreamConstants.CHARACTERS iterates for next 10000 lines. Is there any better way to improve the performance..?
I can see a few possible solutions things that might help you out:
BufferedInputStream
rather than a simple FileInputStream
to reduce the number of disk operationsStringBuilder
to create your pageContent rather than String catenation.-Xmx
option) in case you're memory bound with your 2GB example.It can be quite interesting in cases like this to hook up a code profiler (e.g. Java VisualVM) as you are then able to see exactly what method calls are being slow within your code. You can then focus optimisations appropriately.
If parsing of XML file is the main issue, consider using VTD-XML, namely the extended version as it supports files up to 256GB.
As it is based on non-extractive document parsing, it is quite memory efficient and using it to querying/extract text using XPath is also very fast. You can read more details about this approach and VTD-XML from here.
What is pageContent
? It appears to be a String
. One easy optimization to make right away would be to use a StringBuilder
instead; it can append strings without having to make completely new copies of the strings like String
s +=
does (you can also construct it with an initial reserved capacity to reduce memory reallocations and copies if you have an idea of the length to begin with).
Concatenating String
s is a slow operation because strings are immutable in Java; each time you call a += b
it must allocate a new string, copy a
into it, then copy b
into the end of it; making each concatenation O(n) wrt. total length of the two strings. Same goes for appending single characters. StringBuilder
on the other hand has the same performance characteristics as an ArrayList
when appending. So where you have:
pageContent += chars.getData() + '\n';
Instead change pageContent
to a StringBuilder
and do:
pageContent.append(chars.getData()).append('\n');
Also if you have a guess on the upper bound of the length of one of these strings, you can pass it to the StringBuilder
constructor to allocate an initial amount of capacity and reduce the chance of a memory reallocation and full copy having to be done.
Another option, by the way, is to skip the StringBuilder
altogether and write your data directly to your output file (presuming you're not processing the data somehow first). If you do this, and performance is I/O-bound, choosing an output file on a different physical disk can help.
Try to parse with SAX parser because DOM will try to parse the entire content and place it in memory. Because of this you are getting Memory exception. SAX parser will not parse the entire content at one stretch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With