Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge >1000 xml files into one using Java

I am trying to merge many xml files into one. I have successfully done that in DOM, but this solution is limited to a few files. When I run it on multiple files >1000 I am getting a java.lang.OutOfMemoryError.

What I want to achieve is where i have the following files

file 1:

<root>
....
</root>

file 2:

<root>
......
</root>

file n:

<root>
....
</root>

resulting in: output:

<rootSet>
<root>
....
</root>
<root>
....
</root>
<root>
....
</root>
</rootSet>

This is my current implementation:

    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
    Document doc = docBuilder.newDocument();
    Element rootSetElement = doc.createElement("rootSet");
    Node rootSetNode = doc.appendChild(rootSetElement);
    Element creationElement = doc.createElement("creationDate");
    rootSetNode.appendChild(creationElement);
    creationElement.setTextContent(dateString); 
    File dir = new File("/tmp/rootFiles");
    String[] files = dir.list();
    if (files == null) {
        System.out.println("No roots to merge!");
    } else {
        Document rootDocument;
            for (int i=0; i<files.length; i++) {
                       File filename = new File(dir+"/"+files[i]);        
               rootDocument = docBuilder.parse(filename);
               Node tempDoc = doc.importNode((Node) Document.getElementsByTagName("root").item(0), true);
               rootSetNode.appendChild(tempDoc);
        }
    }   

I have experimented a lot with xslt, sax, but I seem to keep missing something. Any help would be highly appreciated

like image 639
Andra Avatar asked May 25 '12 18:05

Andra


3 Answers

You might also consider using StAX. Here's code that would do what you want:

import java.io.File;
import java.io.FileWriter;
import java.io.Writer;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.stream.StreamSource;

public class XMLConcat {
    public static void main(String[] args) throws Throwable {
        File dir = new File("/tmp/rootFiles");
        File[] rootFiles = dir.listFiles();

        Writer outputWriter = new FileWriter("/tmp/mergedFile.xml");
        XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory();
        XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter);
        XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory();

        xmlEventWriter.add(xmlEventFactory.createStartDocument());
        xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet"));

        XMLInputFactory xmlInFactory = XMLInputFactory.newFactory();
        for (File rootFile : rootFiles) {
            XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile));
            XMLEvent event = xmlEventReader.nextEvent();
            // Skip ahead in the input to the opening document element
            while (event.getEventType() != XMLEvent.START_ELEMENT) {
                event = xmlEventReader.nextEvent();
            }

            do {
                xmlEventWriter.add(event);
                event = xmlEventReader.nextEvent();
            } while (event.getEventType() != XMLEvent.END_DOCUMENT);
            xmlEventReader.close();
        }

        xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet"));
        xmlEventWriter.add(xmlEventFactory.createEndDocument());

        xmlEventWriter.close();
        outputWriter.close();
    }
}

One minor caveat is that this API seems to mess with empty tags, changing <foo/> into <foo></foo>.

like image 84
csd Avatar answered Oct 23 '22 03:10

csd


Just do it without any xml-parsing as it doesn't seem to require any actual parsing of the xml.

For efficiency do something like this:

File dir = new File("/tmp/rootFiles");
String[] files = dir.list();
if (files == null) {
    System.out.println("No roots to merge!");
} else {
        try (FileChannel output = new FileOutputStream("output").getChannel()) {
            ByteBuffer buff = ByteBuffer.allocate(32);
            buff.put("<rootSet>\n".getBytes()); // specify encoding too
            buff.flip();
            output.write(buff);
            buff.clear();
            for (String file : files) {
                try (FileChannel in = new FileInputStream(new File(dir, file).getChannel()) {
                    in.transferTo(0, 1 << 24, output);
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            buff.put("</rootSet>\n".getBytes()); // specify encoding too
            buff.flip();
            output.write(buff);
        } catch (IOException e) {
            e.printStackTrace();
        }
like image 37
Mattias Isegran Bergander Avatar answered Oct 23 '22 04:10

Mattias Isegran Bergander


DOM needs to keep the whole document in memory. If you don't need to do any special operation with your tags, I would simply use an InputStream and read all the files. If you need to do some operations, then use SAX.

like image 29
Carlos Tasada Avatar answered Oct 23 '22 05:10

Carlos Tasada