Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting huge xml file >10GB into small chunks using Stax Parser

Tags:

java

xml

stax

We have Scenario where we need to split large xml file of size more than 10GB in small chunks. Each chunk should contain 100 or 200 element. Example xml

<Employees>
  <Employee id="1">
    <age>29</age>
    <name>Pankaj</name>
    <gender>Male</gender>
    <role>Java Developer</role>
  </Employee>
  <Employee id="3">
    <age>35</age>
    <name>Lisa</name>
    <gender>Female</gender>
    <role>CEO</role>
  </Employee>
  <Employee id="3">
    <age>40</age>
    <name>Tom</name>
    <gender>Male</gender>
    <role>Manager</role>
  </Employee>
  <Employee id="3">
    <age>25</age>
    <name>Meghna</name>
    <gender>Female</gender>
    <role>Manager</role>
  </Employee>
  <Employee id="3">
    <age>29</age>
    <name>Pankaj</name>
    <gender>Male</gender>
    <role>Java Developer</role>
  </Employee>
  <Employee id="3">
    <age>35</age>
    <name>Lisa</name>
    <gender>Female</gender>
    <role>CEO</role>
  </Employee>
  <Employee id="3">
    <age>40</age>
    <name>Tom</name>
    <gender>Male</gender>
    <role>Manager</role>
 </Employee>
</Employees>

I have Stax parser code which will split file into small chunks. But each file contains only one complete Employee element, where I need 100 or 200 or more <Employee> elements in single file. Here is my java code

public static void main(String[] s) throws Exception{
     String prefix = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"+"\n";
        String suffix = "\n</Employees>\n";
        int count=0;
        try {

        int i=0;
             XMLInputFactory xif = XMLInputFactory.newInstance();
             XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("D:\\Desktop\\Test\\latestxml\\test.xml"));
             xsr.nextTag(); // Advance to statements element

             TransformerFactory tf = TransformerFactory.newInstance();
             Transformer t = tf.newTransformer();
             while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
                 File file = new File("C:\\Users\\test\\Desktop\\xml\\"+"out"  +i+ ".xml");
                 FileOutputStream fos=new FileOutputStream(file,true);
                 t.transform(new StAXSource(xsr), new StreamResult(fos));
                 i++;

             }

        } catch (Exception e) {
            e.printStackTrace();
        }
like image 858
Naveen Avatar asked Dec 08 '15 05:12

Naveen


2 Answers

Do not put i with every iteration, it should be update with latest count when your iteration reach to 100 or 200

Like:

String outputPath = "/test/path/foo.txt";

    while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {

                    FileOutputStream file = new FileOutputStream(outputPath,true);
                     ... 
                     ...
                     count ++; 
                     if(count == 100){
                      i++;
                      outputPath = "/test/path/foo"+i+"txt";
                      count = 0;
                      }  
                 }
like image 156
Simmant Avatar answered Nov 12 '22 00:11

Simmant


i hope i get it right but you only need to increment count each time when you add one employer

        File file = new File("out" + i + ".xml");
        FileOutputStream fos = new FileOutputStream(file, true);
        appendStuff("<Employees>",file);
        while (xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            count++;
            t.transform(new StAXSource(xsr), new StreamResult(fos));
            if(count == 100) {
                count = 0;
                i++;
                appendStuff("</Employees>",file);
                fos.close();
                file = new File("out" + i + ".xml");
                fos = new FileOutputStream(file, true);
                appendStuff("<Employees>",file);
            }
        }

Its not verly nice, but you get the idea

private static void appendStuff(String content, File file) throws IOException {
    FileWriter fw = new FileWriter(file.getAbsoluteFile(),true);
    BufferedWriter bw = new BufferedWriter(fw);
    bw.write(content);
    bw.close();
}
like image 36
Kev Avatar answered Nov 11 '22 22:11

Kev