Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing XML in Java using SAX: value cut in 2 halves

I am trying to read a file format that is based on xml and is called mzXML using SAX in JAVA. It carries partially encoded mass spectrometric data (signals with intensities).

This is what the entry of interest looks like (there is more information around that):

    <peaks ... >eJwBgAN//EByACzkZJkHP/NlAceAXLJAckeQ4CIUJz/203q2...</peaks>

A complete file that forces the Error in my case can be downloaded here.

The String in one of these entries holds about 500 compressed and base64 encoded pairs of doubles (signals and intensities). What I do is to decompress and decode, to get the values (decoding not shown in the example below). That is all working fine on a small dataset. Now I used a bigger one and i ran into a problem that I don´t understand:

The procedure characters(ch,start,length) does not read the complete entry in the line shown before. The length-value seems to be to small.

I did not see that problem, when I just printed the peaks entry to the console, as there are a lot of letters and I did´nt recognize letters were missing. But the decompression fails, when there is information missing. When I repeatedly run this program, it always breaks the same line at the same point without giving any Exception. If I change the mzXML file by e.g. deleting a scan, it breaks at a different position. I found this out using breakpoints in the character() procedure looking at the content of currentValue

Here is the piece of code necessary to recapitulate the problem:

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.DataFormatException;
import java.util.zip.Inflater;

import javax.xml.bind.DatatypeConverter;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ReadXMLFile {

    public static byte[] decompress(byte[] data) throws IOException, DataFormatException { 
        Inflater inflater = new Inflater();  
        inflater.setInput(data); 

        ByteArrayOutputStream outputStream = new ByteArrayOutputStream(data.length); 
        byte[] buffer = new byte[data.length*2]; 
        while (!inflater.finished()) { 
            int count = inflater.inflate(buffer); 
            outputStream.write(buffer, 0, count); 
        } 
        outputStream.close(); 
        byte[] output = outputStream.toByteArray(); 

        return output; 
    } 

    public static void main(String args[]) {

        try {

            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            DefaultHandler handler = new DefaultHandler() {

                boolean peaks = false;

                public void startElement(String uri, String localName,String qName, 
                        Attributes attributes) throws SAXException {

                    if (qName.equalsIgnoreCase("PEAKS")) {
                        peaks = true;
                    }
                }

                public void endElement(String uri, String localName,
                        String qName) throws SAXException {
                    if (peaks) {peaks = false;}
                }

                public void characters(char ch[], int start, int length) throws SAXException {

                    if (peaks) {
                        String currentValue = new String(ch, start, length);
                        System.out.println(currentValue);
                        try {
                            byte[] array = decompress(DatatypeConverter.parseBase64Binary(currentValue));
                            System.out.println(array[1]);

                        } catch (IOException | DataFormatException e) {e.printStackTrace();}
                        peaks = false;
                    }
                }
            };

            saxParser.parse("file1_zlib.mzxml", handler);

        } catch (Exception e) {e.printStackTrace();}
    }

}

Is there a safer way to read large xml files? Can you tell me where the error comes from or how to avoid it?

Thanks, Michael

like image 521
MichaG Avatar asked Nov 05 '13 13:11

MichaG


People also ask

How SAX is an alternative method for parsing XML document?

SAX (Simple API for XML) is an event-driven algorithm for parsing XML documents. SAX is an alternative to the Document Object Model (DOM). Where the DOM reads the whole document to operate on XML, SAX parsers read XML node by node, issuing parsing events while making a step through the input stream.

How can parsing the XML data using DOM and SAX?

The two common ways to parse an XML document are given below: DOM Parser: Parsing the document by loading all the content of the document and creating its hierarchical tree structure. SAX Parser: Parsing based on event-based triggers. It does not require the complete loading of content.

How does a SAX XML parser work?

SAXParser provides method to parse XML document using event handlers. This class implements XMLReader interface and provides overloaded versions of parse() methods to read XML document from File, InputStream, SAX InputSource and String URI. The actual parsing is done by the Handler class.

Which method does SAX use for processing XML documents?

SAX. The Simple API for XML (SAX) is an event-based API that uses callback routines or event handlers to process different parts of an XML documents.


1 Answers

The procedure characters(ch,start,length) does not read the complete entry in the line shown before. The length-value seems to be to small.

That is precisely the way it is desgined to work. From the documentation of ContentHandler:

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks.

Therefore, you should not try calling decompress inside the characters implementation. Instead, you should append the characters that you get to an expandable buffer, and call decompress only when you get the corresponding endElement:

StringBuilder sb = null;

public void startElement(String uri, String localName,String qName, 
    Attributes attributes) throws SAXException {
    if (qName.equalsIgnoreCase("PEAKS")) {
        sb = new StringBuilder();
    }
}

public void endElement(String uri, String localName, String qName) throws SAXException {
    if (sb == null) return;
    try {
        byte[] array = decompress(DatatypeConverter.parseBase64Binary(sb.toString()));
        System.out.println(array[1]);
    } catch (IOException | DataFormatException e) {e.printStackTrace();}
    sb = null;
}

public void characters(char ch[], int start, int length) throws SAXException {
    if (sb == null) return;
    String currentValue = new String(ch, start, length);
    sb.appens(currentValue);
}
like image 145
Sergey Kalinichenko Avatar answered Oct 28 '22 22:10

Sergey Kalinichenko