Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge XML file to text files

Tags:

java

xml

I have a huge XML file(15 GB). I want to convert a 'text' tag in XML file to a single page.

Sample XML file:

<root>
    <page>
        <id> 1 </id>
        <text>
        .... 1000 to 50000 lines of text
        </text>
    </page>
    ... Like wise 2 Million `page` tags
</root>

I've initially used DOM parser, but it throws JAVA OUT OF MEMORY(Valid). Now, I've written JAVA code using STAX. It works good, but performance is really slow.

This is the code I've written:

 XMLEventReader xMLEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(filePath));
    while(xMLEventReader.hasNext()){
      xmlEvent = xMLEventReader.nextEvent();

    switch(xmlEvent.getEventType()){
    case XMLStreamConstants.START_ELEMENT:
    if( element == "text")
      isText    = true;
    break;
    case XMLStreamConstants.CHARACTERS:
      chars = (Characters) xmlEvent;
      if(! (chars.isWhiteSpace() || chars.isIgnorableWhiteSpace()))
               if(isText)
              pageContent += chars.getData() + '\n';
      break;
    case XMLStreamConstants.END_ELEMENT:
      String elementEnd = (((EndElement) xmlEvent).getName()).getLocalPart();
      if( elementEnd == "text" )
      {
          createFile(id, pageContent);
          pageContent = "";
          isText = false;
      }
      break;
    }
}

This code is working good.(Ignore about any minor errors). According to my understanding, XMLStreamConstants.CHARACTERS iterates for each and everyline of text tag. If TEXT tag has 10000 lines in it, XMLStreamConstants.CHARACTERS iterates for next 10000 lines. Is there any better way to improve the performance..?

like image 925
user1919035 Avatar asked Mar 07 '14 04:03

user1919035


4 Answers

I can see a few possible solutions things that might help you out:

  1. Use a BufferedInputStream rather than a simple FileInputStream to reduce the number of disk operations
  2. Consider using a StringBuilder to create your pageContent rather than String catenation.
  3. Increase your Java heap (-Xmx option) in case you're memory bound with your 2GB example.

It can be quite interesting in cases like this to hook up a code profiler (e.g. Java VisualVM) as you are then able to see exactly what method calls are being slow within your code. You can then focus optimisations appropriately.

like image 184
Richard Miskin Avatar answered Nov 15 '22 17:11

Richard Miskin


If parsing of XML file is the main issue, consider using VTD-XML, namely the extended version as it supports files up to 256GB.

As it is based on non-extractive document parsing, it is quite memory efficient and using it to querying/extract text using XPath is also very fast. You can read more details about this approach and VTD-XML from here.

like image 37
xlm Avatar answered Nov 15 '22 16:11

xlm


What is pageContent? It appears to be a String. One easy optimization to make right away would be to use a StringBuilder instead; it can append strings without having to make completely new copies of the strings like Strings += does (you can also construct it with an initial reserved capacity to reduce memory reallocations and copies if you have an idea of the length to begin with).

Concatenating Strings is a slow operation because strings are immutable in Java; each time you call a += b it must allocate a new string, copy a into it, then copy b into the end of it; making each concatenation O(n) wrt. total length of the two strings. Same goes for appending single characters. StringBuilder on the other hand has the same performance characteristics as an ArrayList when appending. So where you have:

pageContent += chars.getData() + '\n';

Instead change pageContent to a StringBuilder and do:

pageContent.append(chars.getData()).append('\n');

Also if you have a guess on the upper bound of the length of one of these strings, you can pass it to the StringBuilder constructor to allocate an initial amount of capacity and reduce the chance of a memory reallocation and full copy having to be done.

Another option, by the way, is to skip the StringBuilder altogether and write your data directly to your output file (presuming you're not processing the data somehow first). If you do this, and performance is I/O-bound, choosing an output file on a different physical disk can help.

like image 44
Jason C Avatar answered Nov 15 '22 16:11

Jason C


Try to parse with SAX parser because DOM will try to parse the entire content and place it in memory. Because of this you are getting Memory exception. SAX parser will not parse the entire content at one stretch.

like image 35
Shriram Avatar answered Nov 15 '22 17:11

Shriram