I have an application that uses XmlUnit to get differences from two XML files. But the problem is that XmlUnit uses JDOM. My xml Files are ~1GB big!
It take too much RAM to store those xml in a JDOM document.
I have tried to use SlimJDOMFactory but still uses too much RAM!!
Actually i need to navigate forward and backward in the XML files. And without JDOM i found no simple way .
Can anyone help?
Here is a sample of code on how i build my JDOM documents:
private org.jdom2.Document refDocJdom2;
private org.jdom2.Document resDocJdom2;
SAXBuilder sxb = new SAXBuilder();
sxb.setJDOMFactory(new SlimJDOMFactory());
popmsg("Validating reference file...");
try {
refDocJdom2 = sxb.build(referenceXML_Path);
} catch (Exception e) {
JOptionPane.showMessageDialog(null, "Error while parsing Reference : "+referenceXML_Path+" file.\nCheck XML file validity.");
return;
}
popmsg("Reference file validated");
popmsg("Validating result file....");
try {
resDocJdom2 = sxb.build(resultXML_Path);
} catch (Exception e) {
JOptionPane.showMessageDialog(null, "Error while parsing result "+resultXML_Path+" file.\nCheck XML file validity.");
return;
}
popmsg("Result file validated");
popmsg("Validation Done.");
getDifferencies(referenceXML_Path, resultXML_Path);
d2 = new Date();
}
public void getDifferencies(String fileRef, String fileRes) throws SAXException, IOException {
popmsg("Documents : VALID XML format");
popmsg("Shearching for differencies....");
Reader refReader;
refReader = new FileReader(fileRef);
Reader resReader = new FileReader(fileRes);
Diff aDifference = new Diff(refReader, resReader);
if(refReader != null){
refReader.close();
}
refReader = null;
if(resReader != null){
resReader.close();
}
resReader = null;
//TODO
// XMLUnit.setIgnoreWhitespace(true);
myDetailledDiff = new DetailedDiff(aDifference);
myDetailledDiff.overrideDifferenceListener(new IgnoreNamedElementsDifferenceListener());
myDetailledDiff.overrideElementQualifier(new ElementNameAndAttributeQualifier());
allDiffs = myDetailledDiff.getAllDifferences();
myDetailledDiff = null;
popmsg("Got all differencies...\nGoing to Sort them now...");
popmsg("Diff SIZE : "+allDiffs.size());
myDiffsList = new ArrayList<MyDifference>(allDiffs.size());
if(allDiffs.size() > 0){
Difference aDiff;
for (int i = 0; i < allDiffs.size(); i++){
aDiff = (Difference) allDiffs.get(i);
myDiffsList.add(new MyDifference(aDiff, refDocJdom2, resDocJdom2));
if(myDiffsList.size() == LIMIT)
return ;
if (i%25 == 0 && i!= 0){
popmsg("**************************************************\t"+i+"\n");
}
}
allDiffs.clear();
allDiffs = null;
}else{
popmsg("NO DIFERENCIES");
}
}
JDOM reads the entire XML Document in to memory. This is 'normal' for any Memory-based model of XML (XOM/DOM/JDOM/etc.). It is also the well-known weakness of these systems. Ultimately, there is no solution for this problem, while still keeping an in-memory representation of the whole XML.
When reading an XML document (typically UTF-8), the 1GB of data on disk typically translates proportionately in to that many characters in memory, which is about 2GB. That is what you should 'budget' for a 1GB XML document.
The SlimJDOMFactory reuses Strings inside the XML, instead of keeping references to new ones, essentially it de-duplicates string values. This is very convenient when you have many elements, tags, and other structures with the same names. For example, without the SlimJDOMFactory, an XML document with 1M <tag />
elements, will have 1M different Element instances, each with their own name tag
. Assuming tag
is about a 32Byte object, then there will be about 32MB needed to store those strings. The SlimJDOMFactory will reduce that to just 32Bytes, but, that only goes 'so far', and it does not solve the fact that as the document grows, it will take more space..... it just 'delays' when you run out of memory. It has some other consequences, both good and bad....: Good, it reduces garbage-collection time because there is less memory used to scan, it slows (slightly) the document load time as it de-duplicates. My testing indicates that for documents that live in memory for even a few GC cycles, that the net benefit of the smaller in-memory footprint is quickly realized, and that the performance cost on the parse-side is 'paid back'.
Typical solutions for this problem are:
None of these solutions are 'great', but that's what you get with an in-memory XML system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With