Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java SAX parser progress monitoring

I'm writing a SAX parser in Java to parse a 2.5GB XML file of wikipedia articles. Is there a way to monitor the progress of the parsing in Java?

like image 925
Danijel Avatar asked Jun 23 '10 08:06

Danijel


People also ask

Why is SAX parser faster than Dom?

A SAX parser, however, is much more space efficient in case of a big input document (because it creates no internal structure). What's more, it runs faster and is easier to learn than DOM parser because its API is really simple.

What is SAXParser in Java?

public abstract class SAXParser extends Object. Defines the API that wraps an XMLReader implementation class. In JAXP 1.0, this class wrapped the Parser interface, however this interface was replaced by the XMLReader .

Are SAX and StAX push parsers or pull parsers?

SAX is read only, so another API is needed if you want to write XML documents. SAX is a push API, whereas StAX is pull.


4 Answers

Thanks to EJP's suggestion of ProgressMonitorInputStream, in the end I extended FilterInputStream so that ChangeListener can be used to monitor the current read location in term of bytes.

With this you have finer control, for example to show multiple progress bars for parallel reading of big xml files. Which is exactly what I did.

So, a simplified version of the monitorable stream:

/**
 * A class that monitors the read progress of an input stream.
 *
 * @author Hermia Yeung "Sheepy"
 * @since 2012-04-05 18:42
 */
public class MonitoredInputStream extends FilterInputStream {
   private volatile long mark = 0;
   private volatile long lastTriggeredLocation = 0;
   private volatile long location = 0;
   private final int threshold;
   private final List<ChangeListener> listeners = new ArrayList<>(4);


   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * @param in Underlying input stream, should be non-null because of no public setter
    * @param threshold Min. position change (in byte) to trigger change event.
    */
   public MonitoredInputStream(InputStream in, int threshold) {
      super(in);
      this.threshold = threshold;
   }

   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * Default threshold is 16KB, small threshold may impact performance impact on larger streams.
    * @param in Underlying input stream, should be non-null because of no public setter
    */
   public MonitoredInputStream(InputStream in) {
      super(in);
      this.threshold = 1024*16;
   }

   public void addChangeListener(ChangeListener l) { if (!listeners.contains(l)) listeners.add(l); }
   public void removeChangeListener(ChangeListener l) { listeners.remove(l); }
   public long getProgress() { return location; }

   protected void triggerChanged( final long location ) {
      if ( threshold > 0 && Math.abs( location-lastTriggeredLocation ) < threshold ) return;
      lastTriggeredLocation = location;
      if (listeners.size() <= 0) return;
      try {
         final ChangeEvent evt = new ChangeEvent(this);
         for (ChangeListener l : listeners) l.stateChanged(evt);
      } catch (ConcurrentModificationException e) {
         triggerChanged(location);  // List changed? Let's re-try.
      }
   }


   @Override public int read() throws IOException {
      final int i = super.read();
      if ( i != -1 ) triggerChanged( location++ );
      return i;
   }

   @Override public int read(byte[] b, int off, int len) throws IOException {
      final int i = super.read(b, off, len);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public long skip(long n) throws IOException {
      final long i = super.skip(n);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public void mark(int readlimit) {
      super.mark(readlimit);
      mark = location;
   }

   @Override public void reset() throws IOException {
      super.reset();
      if ( location != mark ) triggerChanged( location = mark );
   }
}

It doesn't know - or care - how big the underlying stream is, so you need to get it some other way, such as from the file itself.

So, here goes the simplified sample usage:

try (
   MonitoredInputStream mis = new MonitoredInputStream(new FileInputStream(file), 65536*4) 
) {

   // Setup max progress and listener to monitor read progress
   progressBar.setMaxProgress( (int) file.length() ); // Swing thread or before display please
   mis.addChangeListener( new ChangeListener() { @Override public void stateChanged(ChangeEvent e) {
      SwingUtilities.invokeLater( new Runnable() { @Override public void run() {
         progressBar.setProgress( (int) mis.getProgress() ); // Promise me you WILL use MVC instead of this anonymous class mess! 
      }});
   }});
   // Start parsing. Listener would call Swing event thread to do the update.
   SAXParserFactory.newInstance().newSAXParser().parse(mis, this);

} catch ( IOException | ParserConfigurationException | SAXException e) {

   e.printStackTrace();

} finally {

   progressBar.setVisible(false); // Again please call this in swing event thread

}

In my case the progresses raise nicely from left to right without abnormal jumps. Adjust threshold for optimum balance between performance and responsiveness. Too small and the reading speed can more then double on small devices, too big and the progress would not be smooth.

Hope it helps. Feel free to edit if you found mistakes or typos, or vote up to send me some encouragements! :D

like image 113
Sheepy Avatar answered Oct 15 '22 02:10

Sheepy


Use a javax.swing.ProgressMonitorInputStream.

like image 44
user207421 Avatar answered Oct 15 '22 01:10

user207421


You can get an estimate of the current line/column in your file by overriding the method setDocumentLocator of org.xml.sax.helpers.DefaultHandler/BaseHandler. This method is called with an object from which you can get an approximation of the current line/column when needed.

Edit: To the best of my knowledge, there is no standard way to get the absolute position. However, I am sure some SAX implementations do offer this kind of information.

like image 2
Po' Lazarus Avatar answered Oct 15 '22 02:10

Po' Lazarus


Assuming you know how many articles you have, can't you just keep a counter in the handler? E.g.

public void startElement (String uri, String localName, 
                          String qName, Attributes attributes) 
                          throws SAXException {
    if(qName.equals("article")){
        counter++
    }
    ...
}

(I don't know whether you are parsing "article", it's just an example)

If you don't know the number of article in advance, you will need to count it first. Then you can print the status nb tags read/total nb of tags, say each 100 tags (counter % 100 == 0).

Or even have another thread monitor the progress. In this case, you might want to synchronize access to the counter, but not necessary given that it doesn't need to be really accurate.

My 2 cents

like image 1
ewernli Avatar answered Oct 15 '22 02:10

ewernli