Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JDOM Is using too much memory

I have an application that uses XmlUnit to get differences from two XML files. But the problem is that XmlUnit uses JDOM. My xml Files are ~1GB big!

It take too much RAM to store those xml in a JDOM document.

I have tried to use SlimJDOMFactory but still uses too much RAM!!

Actually i need to navigate forward and backward in the XML files. And without JDOM i found no simple way .

Can anyone help?

Here is a sample of code on how i build my JDOM documents:

    private org.jdom2.Document refDocJdom2;
    private org.jdom2.Document resDocJdom2;
    SAXBuilder sxb = new SAXBuilder(); 
    sxb.setJDOMFactory(new SlimJDOMFactory());

    popmsg("Validating reference file...");
    try {
        refDocJdom2 = sxb.build(referenceXML_Path); 
    } catch (Exception e) { 
        JOptionPane.showMessageDialog(null, "Error while parsing   Reference : "+referenceXML_Path+" file.\nCheck XML file validity.");
        return;
    }
    popmsg("Reference file validated");

    popmsg("Validating result file....");
    try {
        resDocJdom2 = sxb.build(resultXML_Path); 
    } catch (Exception e) { 
        JOptionPane.showMessageDialog(null, "Error while parsing result "+resultXML_Path+" file.\nCheck XML file validity.");
        return;
    }
    popmsg("Result file validated");
    popmsg("Validation Done.");

    getDifferencies(referenceXML_Path, resultXML_Path);
    d2 = new Date();

  }
public void getDifferencies(String fileRef, String fileRes) throws SAXException, IOException {
    popmsg("Documents : VALID XML format");
    popmsg("Shearching for differencies....");

    Reader refReader;

    refReader = new FileReader(fileRef);
    Reader resReader = new FileReader(fileRes);
    Diff aDifference = new Diff(refReader, resReader);

    if(refReader != null){
        refReader.close();
    }
    refReader = null;

    if(resReader != null){
        resReader.close();
    }
    resReader = null;

    //TODO
     //     XMLUnit.setIgnoreWhitespace(true);

    myDetailledDiff = new DetailedDiff(aDifference);
    myDetailledDiff.overrideDifferenceListener(new IgnoreNamedElementsDifferenceListener());
    myDetailledDiff.overrideElementQualifier(new ElementNameAndAttributeQualifier()); 
    allDiffs = myDetailledDiff.getAllDifferences();
    myDetailledDiff = null;

    popmsg("Got all differencies...\nGoing to Sort them now...");

    popmsg("Diff SIZE : "+allDiffs.size());
    myDiffsList = new ArrayList<MyDifference>(allDiffs.size());
    if(allDiffs.size() > 0){
        Difference aDiff;
        for (int i = 0; i < allDiffs.size(); i++){
            aDiff =  (Difference) allDiffs.get(i);

            myDiffsList.add(new MyDifference(aDiff, refDocJdom2, resDocJdom2));

            if(myDiffsList.size() == LIMIT)
                return ;
            if (i%25 == 0 && i!= 0){
                popmsg("**************************************************\t"+i+"\n");
            }
        }

        allDiffs.clear();
        allDiffs = null;

    }else{
        popmsg("NO DIFERENCIES");
    }
}
like image 216
JajaDrinker Avatar asked May 23 '14 12:05

JajaDrinker


1 Answers

JDOM reads the entire XML Document in to memory. This is 'normal' for any Memory-based model of XML (XOM/DOM/JDOM/etc.). It is also the well-known weakness of these systems. Ultimately, there is no solution for this problem, while still keeping an in-memory representation of the whole XML.

When reading an XML document (typically UTF-8), the 1GB of data on disk typically translates proportionately in to that many characters in memory, which is about 2GB. That is what you should 'budget' for a 1GB XML document.

The SlimJDOMFactory reuses Strings inside the XML, instead of keeping references to new ones, essentially it de-duplicates string values. This is very convenient when you have many elements, tags, and other structures with the same names. For example, without the SlimJDOMFactory, an XML document with 1M <tag /> elements, will have 1M different Element instances, each with their own name tag. Assuming tag is about a 32Byte object, then there will be about 32MB needed to store those strings. The SlimJDOMFactory will reduce that to just 32Bytes, but, that only goes 'so far', and it does not solve the fact that as the document grows, it will take more space..... it just 'delays' when you run out of memory. It has some other consequences, both good and bad....: Good, it reduces garbage-collection time because there is less memory used to scan, it slows (slightly) the document load time as it de-duplicates. My testing indicates that for documents that live in memory for even a few GC cycles, that the net benefit of the smaller in-memory footprint is quickly realized, and that the performance cost on the parse-side is 'paid back'.

Typical solutions for this problem are:

  1. use SAX - directly, and not have an in-memory model at all....
  2. split the input files in to smaller chunks. This is the normal solution, and it makes a lot of sense for many reason (it reduces latencies, you can parse the files in parallel, etc.)
  3. logically split the XML in to sections that are still valid XML, and parse portions of the file using special InputStreams on file-subsets.
  4. Add more memory to your system.
  5. Use a custom JDOMFactory that skips content you know you will never need (the JDOMFactory is invoked as part of the document SAXBuild process... so, you actually can 'trim' the file contents to just that subset you know you will need...... and still end up with a JDOM document that is in-memory, and navigables (what's left of it).

None of these solutions are 'great', but that's what you get with an in-memory XML system.

like image 98
rolfl Avatar answered Nov 08 '22 16:11

rolfl