Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does SAXParser read so much before throwing events?

Scenario: I'm receiving a huge xml file via extreme slow network so I want so start the excessive processing as early as possible. Because of that I decided to use SAXParser.

I expected that after a tag is finished I will get an event.

The following test shows what I mean:

@Test
public void sax_parser_read_much_things_before_returning_events() throws Exception{
    String xml = "<a>"
               + "  <b>..</b>"
               + "  <c>..</c>"
                  // much more ...
               + "</a>";

    // wrapper to show what is read
    InputStream is = new InputStream() {
        InputStream is = new ByteArrayInputStream(xml.getBytes());

        @Override
        public int read() throws IOException {
            int val = is.read();
            System.out.print((char) val);
            return val;
        }
    };

    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(is, new DefaultHandler(){
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            System.out.print("\nHandler start: " + qName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            System.out.print("\nHandler end: " + qName);
        }
    });
}

I wrapped the input stream to see what is read and when the events occur.

What I expected was something like this:

<a>                    <- output from read()
Handler start: a
<b>                    <- output from read()
Handler start: b
</b>                   <- output from read()
Handler end: b
...

Sadly the result was following:

<a>  <b>..</b>  <c>..</c></a>        <- output from read()
Handler start: a
Handler start: b
Handler end: b
Handler start: c
Handler end: c
Handler end: a

Where is my mistake and how can I get the expected result?

Edit:

  • First thing is that he's trying to detect the doc version, which causes to scan everything. With doc version he breaks in between (but not where I expect)
  • It is not ok that he "wants to" read for example 1000 bytes and blocks for so long because its possible that stream doesn't contain so much at this point of time.
  • I found the buffer sizes in XMLEntityManager:
    • public static final int DEFAULT_BUFFER_SIZE = 8192;
    • public static final int DEFAULT_XMLDECL_BUFFER_SIZE = 64;
    • public static final int DEFAULT_INTERNAL_BUFFER_SIZE = 1024;
like image 755
Marcel Avatar asked Oct 20 '22 00:10

Marcel


1 Answers

It seems you are making wrong assumptions about how the I/O works. An XML parser, like most software, will request data in chunks, because requesting single bytes from a stream is a recipe for a performance disaster.

This does not imply that the buffer must get completely filled before a read attempt returns. It’s just, that a ByteArrayInputStream is incapable of emulating the behavior of a network InputStream. You can easily fix that by overriding the read(byte[], int, int) and not returning a complete buffer but, e.g. a single byte on every request:

@Test
public void sax_parser_read_much_things_before_returning_events() throws Exception{
    final String xml = "<a>"
               + "  <b>..</b>"
               + "  <c>..</c>"
                  // much more ...
               + "</a>";

    // wrapper to show what is read
    InputStream is = new InputStream() {
        InputStream is = new ByteArrayInputStream(xml.getBytes());

        @Override
        public int read() throws IOException {
            int val = is.read();
            System.out.print((char) val);
            return val;
        }
        @Override
        public int read(byte[] b, int off, int len) throws IOException {
            return super.read(b, off, 1);
        }
    };

    SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(is, new DefaultHandler(){
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            System.out.print("\nHandler start: " + qName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            System.out.print("\nHandler end: " + qName);
        }
    });
}

This will print

<a>  
Handler start: a<b>
Handler start: b..</b>
Handler end: b  <c>
Handler start: c..</c>
Handler end: c</a>
Handler end: a?

showing, how the XML parser adapts to the availability of data from the InputStream.

like image 109
Holger Avatar answered Oct 22 '22 05:10

Holger