I am trying to merge many xml files into one. I have successfully done that in DOM, but this solution is limited to a few files. When I run it on multiple files >1000 I am getting a java.lang.OutOfMemoryError.
What I want to achieve is where i have the following files
file 1:
<root>
....
</root>
file 2:
<root>
......
</root>
file n:
<root>
....
</root>
resulting in: output:
<rootSet>
<root>
....
</root>
<root>
....
</root>
<root>
....
</root>
</rootSet>
This is my current implementation:
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element rootSetElement = doc.createElement("rootSet");
Node rootSetNode = doc.appendChild(rootSetElement);
Element creationElement = doc.createElement("creationDate");
rootSetNode.appendChild(creationElement);
creationElement.setTextContent(dateString);
File dir = new File("/tmp/rootFiles");
String[] files = dir.list();
if (files == null) {
System.out.println("No roots to merge!");
} else {
Document rootDocument;
for (int i=0; i<files.length; i++) {
File filename = new File(dir+"/"+files[i]);
rootDocument = docBuilder.parse(filename);
Node tempDoc = doc.importNode((Node) Document.getElementsByTagName("root").item(0), true);
rootSetNode.appendChild(tempDoc);
}
}
I have experimented a lot with xslt, sax, but I seem to keep missing something. Any help would be highly appreciated
You might also consider using StAX. Here's code that would do what you want:
import java.io.File;
import java.io.FileWriter;
import java.io.Writer;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.stream.StreamSource;
public class XMLConcat {
public static void main(String[] args) throws Throwable {
File dir = new File("/tmp/rootFiles");
File[] rootFiles = dir.listFiles();
Writer outputWriter = new FileWriter("/tmp/mergedFile.xml");
XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory();
XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter);
XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory();
xmlEventWriter.add(xmlEventFactory.createStartDocument());
xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet"));
XMLInputFactory xmlInFactory = XMLInputFactory.newFactory();
for (File rootFile : rootFiles) {
XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile));
XMLEvent event = xmlEventReader.nextEvent();
// Skip ahead in the input to the opening document element
while (event.getEventType() != XMLEvent.START_ELEMENT) {
event = xmlEventReader.nextEvent();
}
do {
xmlEventWriter.add(event);
event = xmlEventReader.nextEvent();
} while (event.getEventType() != XMLEvent.END_DOCUMENT);
xmlEventReader.close();
}
xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet"));
xmlEventWriter.add(xmlEventFactory.createEndDocument());
xmlEventWriter.close();
outputWriter.close();
}
}
One minor caveat is that this API seems to mess with empty tags, changing <foo/>
into <foo></foo>
.
Just do it without any xml-parsing as it doesn't seem to require any actual parsing of the xml.
For efficiency do something like this:
File dir = new File("/tmp/rootFiles");
String[] files = dir.list();
if (files == null) {
System.out.println("No roots to merge!");
} else {
try (FileChannel output = new FileOutputStream("output").getChannel()) {
ByteBuffer buff = ByteBuffer.allocate(32);
buff.put("<rootSet>\n".getBytes()); // specify encoding too
buff.flip();
output.write(buff);
buff.clear();
for (String file : files) {
try (FileChannel in = new FileInputStream(new File(dir, file).getChannel()) {
in.transferTo(0, 1 << 24, output);
} catch (IOException e) {
e.printStackTrace();
}
}
buff.put("</rootSet>\n".getBytes()); // specify encoding too
buff.flip();
output.write(buff);
} catch (IOException e) {
e.printStackTrace();
}
DOM needs to keep the whole document in memory. If you don't need to do any special operation with your tags, I would simply use an InputStream and read all the files. If you need to do some operations, then use SAX.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With