Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using SAX parser on xml file inside a zip

This may be beyond the capabilities of the Java VM due to the size of the files being dealt with (50-100MB xml files)

Right now I have a set of xml files sent as zips, which are in turn all decompressed and then all XML in the directory are processed one at a time using SAX.

To save time and space (since the compression is about 1:10) I was wondering if there is a way to pass a ZipFileEntry that is an xml file to a SAX handler.

I've seen it done using DocumentBuilder and other xml parsing methods, but for peformance (and especially memory) I'm sticking with SAX.

Currently I am using SAX in the following way

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        MyHandler handler = new MyHandler();

        for( String curFile : xmlFiles )
        {
            System.out.println( "\n\n\t>>>>> open " + curFile + " <<<<<\n");
            saxParser.parse( "file://" + new File( dirToProcess + curFile ).getAbsolutePath(), handler );
        }
like image 821
FaultyJuggler Avatar asked Sep 13 '12 16:09

FaultyJuggler


People also ask

How an XML document is parsed using SAX?

This interface requires a number of methods that the SAX parser invokes in response to various parsing events. The major event-handling methods are: startDocument, endDocument, startElement, and endElement. The easiest way to implement this interface is to extend the DefaultHandler class, defined in the org. xml.

Which method does SAX use for processing XML documents?

SAX. The Simple API for XML (SAX) is an event-based API that uses callback routines or event handlers to process different parts of an XML documents. To use SAX, one needs to register handlers for different events and then parse the document.

How do you run a SAX parser?

SAX parser methods to override The important methods to override are startElement() , endElement() and characters() . SAXParser starts parsing the document, when any start element is found, startElement() method is called. We are overriding this method to set boolean variables that will be used to identify the element.


2 Answers

You can parse a XML using an InputStream as a source. So you can open a ZipFile, get the InputStream of the entry you want, and then parse it. See the getInputStream method.

---- Edit ----

Here is some code to guide you:

for( String curFile : xmlFiles )
{
        ZipFile zip = new ZipFile(new File( dirToProcess + curFile));
        Enumeration<? extends ZipEntry> entries = zip.entries();
        while (entries.hasMoreElements()){
            ZipEntry entry = entries.nextElement();
            InputStream xmlStream = zip.getInputStream(entry);
            saxParser.parse( xmlStream, handler );
            xmlStream.close();
        }
}
like image 69
Gilberto Torrezan Avatar answered Sep 28 '22 03:09

Gilberto Torrezan


  • ZipInputStream.read() would read x number of bytes from the ZipFileEntry, unzip them and give you the unzipped bytes.
  • Use any of the methods here to create an in/out stream.
  • Give that in/out stream as InputStream to your parser.
  • Start writing unzipped data to in/out stream (now treated as OutputStream).
  • So you're now reading chunks of data from zip file, unzipping them and passing them to the parser.

PS:

  1. If the zip file contains multiple files see this: extracting contents of ZipFile entries when read from byte[] (Java), you'll have to put in a check such that you know when you reach end of an entry.
  2. I donno much of SAX parser but assume that it would parse the file in this manner (when given in chunks).

--- edit ---

Here is what I meant:

import java.io.File;
import java.io.InputStream;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Main {
    static class MyRunnable implements Runnable {

        private InputStream xmlStream;
        private SAXParser sParser;

        public MyRunnable(SAXParser p, InputStream is) {
            sParser = p;
            xmlStream = is;
        }

        public void run() {
            try {
                sParser.parse(xmlStream, new DefaultHandler() {
                    public void startElement(String uri, String localName, String qName, Attributes attributes)
                            throws SAXException {
                        System.out.println("\nStart Element :" + qName);
                    }

                    public void endElement(String uri, String localName, String qName) throws SAXException {
                        System.out.println("\nEnd Element :" + qName);
                    }
                });
                System.out.println("Done parsing..");
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

    }

    final static int BUF_SIZE = 5;
    public static void main(String argv[]) {

        try {

            SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();

            ZipFile zip = new ZipFile(new File("D:\\Workspaces\\Indigo\\Test\\performance.zip"));
            Enumeration<? extends ZipEntry> entries = zip.entries();
            while (entries.hasMoreElements()) {
                // in stream for parser..
                PipedInputStream xmlStream = new PipedInputStream();
                // out stream attached to in stream above.. we would read from zip file and write to this..
                // thus passing whatever we write to the parser..
                PipedOutputStream out = new PipedOutputStream(xmlStream);
                // Parser blocks in in stream, so put him on a different thread..
                Thread parserThread = new Thread(new Main.MyRunnable(saxParser, xmlStream));
                parserThread.start();

                ZipEntry entry = entries.nextElement();
                System.out.println("\nOpening zip entry: " + entry.getName());
                InputStream unzippedStream = zip.getInputStream(entry);

                byte buf[] = new byte[BUF_SIZE]; int bytesRead = 0;
                while ((bytesRead = unzippedStream.read(buf)) > 0) {
                    // write to err for different color in eclipse..
                    System.err.write(buf, 0, bytesRead);
                    out.write(buf, 0, bytesRead);
                    Thread.sleep(150); // theatrics...
                }

                out.flush();
                // give parser a couple o seconds to catch up just in case there is some IO lag...
                parserThread.join(2000);

                unzippedStream.close(); out.close(); xmlStream.close();
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}
like image 33
Kashyap Avatar answered Sep 28 '22 02:09

Kashyap