This may be beyond the capabilities of the Java VM due to the size of the files being dealt with (50-100MB xml files)
Right now I have a set of xml files sent as zips, which are in turn all decompressed and then all XML in the directory are processed one at a time using SAX.
To save time and space (since the compression is about 1:10) I was wondering if there is a way to pass a ZipFileEntry that is an xml file to a SAX handler.
I've seen it done using DocumentBuilder and other xml parsing methods, but for peformance (and especially memory) I'm sticking with SAX.
Currently I am using SAX in the following way
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
MyHandler handler = new MyHandler();
for( String curFile : xmlFiles )
{
System.out.println( "\n\n\t>>>>> open " + curFile + " <<<<<\n");
saxParser.parse( "file://" + new File( dirToProcess + curFile ).getAbsolutePath(), handler );
}
This interface requires a number of methods that the SAX parser invokes in response to various parsing events. The major event-handling methods are: startDocument, endDocument, startElement, and endElement. The easiest way to implement this interface is to extend the DefaultHandler class, defined in the org. xml.
SAX. The Simple API for XML (SAX) is an event-based API that uses callback routines or event handlers to process different parts of an XML documents. To use SAX, one needs to register handlers for different events and then parse the document.
SAX parser methods to override The important methods to override are startElement() , endElement() and characters() . SAXParser starts parsing the document, when any start element is found, startElement() method is called. We are overriding this method to set boolean variables that will be used to identify the element.
You can parse a XML using an InputStream as a source. So you can open a ZipFile, get the InputStream of the entry you want, and then parse it. See the getInputStream method.
---- Edit ----
Here is some code to guide you:
for( String curFile : xmlFiles )
{
ZipFile zip = new ZipFile(new File( dirToProcess + curFile));
Enumeration<? extends ZipEntry> entries = zip.entries();
while (entries.hasMoreElements()){
ZipEntry entry = entries.nextElement();
InputStream xmlStream = zip.getInputStream(entry);
saxParser.parse( xmlStream, handler );
xmlStream.close();
}
}
ZipInputStream.read()
would read x number of bytes from the ZipFileEntry
, unzip them and give you the unzipped bytes.InputStream
to your parser.OutputStream
).PS:
--- edit ---
Here is what I meant:
import java.io.File;
import java.io.InputStream;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class Main {
static class MyRunnable implements Runnable {
private InputStream xmlStream;
private SAXParser sParser;
public MyRunnable(SAXParser p, InputStream is) {
sParser = p;
xmlStream = is;
}
public void run() {
try {
sParser.parse(xmlStream, new DefaultHandler() {
public void startElement(String uri, String localName, String qName, Attributes attributes)
throws SAXException {
System.out.println("\nStart Element :" + qName);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
System.out.println("\nEnd Element :" + qName);
}
});
System.out.println("Done parsing..");
} catch (Exception e) {
e.printStackTrace();
}
}
}
final static int BUF_SIZE = 5;
public static void main(String argv[]) {
try {
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
ZipFile zip = new ZipFile(new File("D:\\Workspaces\\Indigo\\Test\\performance.zip"));
Enumeration<? extends ZipEntry> entries = zip.entries();
while (entries.hasMoreElements()) {
// in stream for parser..
PipedInputStream xmlStream = new PipedInputStream();
// out stream attached to in stream above.. we would read from zip file and write to this..
// thus passing whatever we write to the parser..
PipedOutputStream out = new PipedOutputStream(xmlStream);
// Parser blocks in in stream, so put him on a different thread..
Thread parserThread = new Thread(new Main.MyRunnable(saxParser, xmlStream));
parserThread.start();
ZipEntry entry = entries.nextElement();
System.out.println("\nOpening zip entry: " + entry.getName());
InputStream unzippedStream = zip.getInputStream(entry);
byte buf[] = new byte[BUF_SIZE]; int bytesRead = 0;
while ((bytesRead = unzippedStream.read(buf)) > 0) {
// write to err for different color in eclipse..
System.err.write(buf, 0, bytesRead);
out.write(buf, 0, bytesRead);
Thread.sleep(150); // theatrics...
}
out.flush();
// give parser a couple o seconds to catch up just in case there is some IO lag...
parserThread.join(2000);
unzippedStream.close(); out.close(); xmlStream.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With