Using SAX parser on xml file inside a zip

This may be beyond the capabilities of the Java VM due to the size of the files being dealt with (50-100MB xml files)

Right now I have a set of xml files sent as zips, which are in turn all decompressed and then all XML in the directory are processed one at a time using SAX.

To save time and space (since the compression is about 1:10) I was wondering if there is a way to pass a ZipFileEntry that is an xml file to a SAX handler.

I've seen it done using DocumentBuilder and other xml parsing methods, but for peformance (and especially memory) I'm sticking with SAX.

Currently I am using SAX in the following way

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        MyHandler handler = new MyHandler();

        for( String curFile : xmlFiles )
            System.out.println( "\n\n\t>>>>> open " + curFile + " <<<<<\n");
            saxParser.parse( "file://" + new File( dirToProcess + curFile ).getAbsolutePath(), handler );
2 Answers

You can parse a XML using an InputStream as a source. So you can open a ZipFile, get the InputStream of the entry you want, and then parse it. See the getInputStream method.

---- Edit ----

Here is some code to guide you:

for( String curFile : xmlFiles )
        ZipFile zip = new ZipFile(new File( dirToProcess + curFile));
        Enumeration<? extends ZipEntry> entries = zip.entries();
        while (entries.hasMoreElements()){
            ZipEntry entry = entries.nextElement();
            InputStream xmlStream = zip.getInputStream(entry);
            saxParser.parse( xmlStream, handler );
  • ZipInputStream.read() would read x number of bytes from the ZipFileEntry, unzip them and give you the unzipped bytes.
  • Use any of the methods here to create an in/out stream.
  • Give that in/out stream as InputStream to your parser.
  • Start writing unzipped data to in/out stream (now treated as OutputStream).
  • So you're now reading chunks of data from zip file, unzipping them and passing them to the parser.


  1. If the zip file contains multiple files see this: extracting contents of ZipFile entries when read from byte[] (Java), you'll have to put in a check such that you know when you reach end of an entry.
  2. I donno much of SAX parser but assume that it would parse the file in this manner (when given in chunks).

--- edit ---

Here is what I meant:

import java.io.File;
import java.io.InputStream;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Main {
    static class MyRunnable implements Runnable {

        private InputStream xmlStream;
        private SAXParser sParser;

        public MyRunnable(SAXParser p, InputStream is) {
            sParser = p;
            xmlStream = is;

        public void run() {
            try {
                sParser.parse(xmlStream, new DefaultHandler() {
                    public void startElement(String uri, String localName, String qName, Attributes attributes)
                            throws SAXException {
                        System.out.println("\nStart Element :" + qName);

                    public void endElement(String uri, String localName, String qName) throws SAXException {
                        System.out.println("\nEnd Element :" + qName);
                System.out.println("Done parsing..");
            } catch (Exception e) {


    final static int BUF_SIZE = 5;
    public static void main(String argv[]) {

        try {

            SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();

            ZipFile zip = new ZipFile(new File("D:\\Workspaces\\Indigo\\Test\\performance.zip"));
            Enumeration<? extends ZipEntry> entries = zip.entries();
            while (entries.hasMoreElements()) {
                // in stream for parser..
                PipedInputStream xmlStream = new PipedInputStream();
                // out stream attached to in stream above.. we would read from zip file and write to this..
                // thus passing whatever we write to the parser..
                PipedOutputStream out = new PipedOutputStream(xmlStream);
                // Parser blocks in in stream, so put him on a different thread..
                Thread parserThread = new Thread(new Main.MyRunnable(saxParser, xmlStream));

                ZipEntry entry = entries.nextElement();
                System.out.println("\nOpening zip entry: " + entry.getName());
                InputStream unzippedStream = zip.getInputStream(entry);

                byte buf[] = new byte[BUF_SIZE]; int bytesRead = 0;
                while ((bytesRead = unzippedStream.read(buf)) > 0) {
                    // write to err for different color in eclipse..
                    System.err.write(buf, 0, bytesRead);
                    out.write(buf, 0, bytesRead);
                    Thread.sleep(150); // theatrics...

                // give parser a couple o seconds to catch up just in case there is some IO lag...

                unzippedStream.close(); out.close(); xmlStream.close();

        } catch (Exception e) {

