Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue Parsing XML Document using SaxParser - 2047 character limit?

Tags:

java

parsing

xml

I have created a class that extends the SaxParser DefaultHandler class. My intent is to store the XML input in a series of objects while preserving the data integrity of the original XML data. During testing, I notice that some of the node data was being truncated arbitrarily on input.

For example:

Input: <temperature>-125</temperature>  Output: <sensitivity>5</sensitivity>
Input: <address>101_State</city>             Output: <address>te</address> 

To further complicate things, the above errors occurs "randomly" for 1 out of every ~100 instances of the same XML tags. Meaning the input XML file has roughly 100 tags that contain <temperature>-125</temperature> but only one of them produces an output of <sensitivity>5</sensitivity>. The other tags accurately produce <sensitivity>-125</sensitivity>.

I have overwritten the abstract "characters(char[] ch, int start, int length)" method to simple grab the character content between XML tags:

public void characters(char[] ch, int start, int length)
            throws SAXException {

            value = new String(ch, start, length);

            //debug
            System.out.println("'" + value + "'" + "start: " + start + "length: " + length);
        }

My println statements produce the following output for the specific temperature tag that results in erroneous output :

> '-12'start: 2045length: 3 '5'start:
> 0length: 1

This tells me that the characters methods is being called twice for this specific xml element. It is being called once for all other xml tags. The "start" value of the secong line signifies to me that the char[] chars is being reset in the middle of this XML tag. And the character method is being called again with the new char [].

Is anyone familiar with this issue? I was wondering if I was reaching the limit of a char []'s capacity. But a quick query renders this unlikely. My char [] seems to be resetting at ~ 2047 characters

Thanks,

LB

like image 426
LB. Avatar asked Sep 29 '09 19:09

LB.


2 Answers

The characters callback method need not be provided with a complete chunk of data by the SAX Parser. The parser could invoke the characters() method multiple times, sending a chunk of data at a time.

The resolution is to accumulate all the data in a buffer, until the next call happens to another method (a non-characters call).

like image 167
Vineet Reynolds Avatar answered Sep 24 '22 06:09

Vineet Reynolds


I spent 2 whole days looking for the solution.

Change your characters method to this:

public void characters(char[] ch, int start, int length) throws SAXException {

  if(value == null)
    value = new String(ch, start, length);
  else
    value += new String(ch, start, length);

  //debug
  System.out.println("'" + value + "'" + "start: " + start + "length: " + length);

}

And its done!!!

like image 21
desidigitalnomad Avatar answered Sep 21 '22 06:09

desidigitalnomad