Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve splitting xml file performance

Tags:

java

xml

I've see quite a lot posts/blogs/articles about splitting XML file into a smaller chunks and decided to create my own because I have some custom requirements. Here is what I mean, consider the following XML :

<?xml version="1.0" encoding="UTF-8" standalone="no" ?> 
<company>
 <staff id="1">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="2">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="3">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="4">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <nickname>mkyong</nickname>
    <salary>100000</salary>
   </staff>
 <staff id="5">
    <firstname>yong</firstname>
    <lastname>mook kim</lastname>
    <salary>100000</salary>
   </staff>
</company>

I want to split this xml into n parts, each containing 1 file, but the staff element must contain nickname , if it's not there I don't want it. So this should produce 4 xml splits, each containing staff id starting at 1 until 4.

Here is my code :

public int split() throws Exception{
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputFilePath)));

        String line;
        List<String> tempList = null;

        while((line=br.readLine())!=null){
            if(line.contains("<?xml version=\"1.0\"") || line.contains("<" + rootElement + ">") || line.contains("</" + rootElement + ">")){
                continue;
            }

            if(line.contains("<"+ element +">")){
                tempList = new ArrayList<String>();
            }
            tempList.add(line);

            if(line.contains("</"+ element +">")){
                if(hasConditions(tempList)){
                    writeToSplitFile(tempList);
                    writtenObjectCounter++;
                    totalCounter++;
                }
            }

            if(writtenObjectCounter == itemsPerFile){
                writtenObjectCounter = 0;
                fileCounter++;          
                tempList.clear();
            }
        }

        if(tempList.size() != 0){
        writeClosingRootElement();
        }

        return totalCounter;
    }

    private void writeToSplitFile(List<String> itemList) throws Exception{
        BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
        if(writtenObjectCounter == 0){
        wr.write("<" + rootElement + ">");
        wr.write("\n");
        }

        for (String string : itemList) {
            wr.write(string);
            wr.write("\n");
        }

        if(writtenObjectCounter == itemsPerFile-1)
        wr.write("</" + rootElement + ">");
        wr.close();
    }

    private void writeClosingRootElement() throws Exception{
        BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
        wr.write("</" + rootElement + ">");
        wr.close();
    }

    private boolean hasConditions(List<String> list){
        int matchList = 0;

        for (String condition : conditionList) {
            for (String string : list) {
                if(string.contains(condition)){
                    matchList++;
                }
            }
        }

        if(matchList >= conditionList.size()){
            return true;
        }

        return false;
    }

I know that opening/closing stream for each written staff element which does impact the performance. But if I write once per file(which may contain n number of staff). Naturally root and split elements are configurable.

Any ideas how can I improve the performance/logic? I'd prefer some code, but good advice can be better sometimes

Edit:

This XML example is actually a dummy example, the real XML which I'm trying to split is about 300-500 different elements under split element all appearing at the random order and number varies. Stax may not be the best solution after all?

Bounty update :

I'm looking for a solution(code) that will:

  • Be able to split XML file into n parts with x split elements(from the dummy XML example staff is the split element).

  • The content of the spitted files should be wrapped in the root element from the original file(like in the dummy example company)

  • I'd like to be able to specify condition that must be in the split element i.e. I want only staff which have nickname, I want to discard those without nicknames. But be able to also split without conditions while running split without conditions.

  • The code doesn't necessarily have to improve my solution(lacking good logic and performance), but it works.

And not happy with "but it works". And I can't find enough examples of Stax for these kind of operations, user community is not great as well. It doesn't have to be Stax solution as well.

I'm probably asking too much, but I'm here to learn stuff, giving good bounty for the solution I think.

like image 264
Gandalf StormCrow Avatar asked Sep 13 '11 21:09

Gandalf StormCrow


1 Answers

First piece of advice: don't try to write your own XML handling code. Use an XML parser - it's going to be much more reliable and quite possibly faster.

If you use an XML pull parser (e.g. StAX) you should be able to read an element at a time and write it out to disk, never reading the whole document in one go.

like image 187
Jon Skeet Avatar answered Sep 30 '22 17:09

Jon Skeet