Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IndexOutOfBoundsException when processing empty CDATA with Transformer

Tags:

java

xml

stax

I want to extract specific nodes from a large XML file. That works well, until a wild CDATA without any content appears.

The output:

ERROR:  ''
javax.xml.transform.TransformerException: java.lang.IndexOutOfBoundsException
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:732)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:336)
    at xml_test.XML_Test.extractXML2(XML_Test.java:698)
    at xml_test.XML_Test.main(XML_Test.java:811)
Caused by: java.lang.IndexOutOfBoundsException
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getTextCharacters(XMLStreamReaderImpl.java:1143)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleCharacters(StAXStream2SAX.java:261)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:171)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:120)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:674)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:723)
    ... 3 more
---------
java.lang.IndexOutOfBoundsException
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getTextCharacters(XMLStreamReaderImpl.java:1143)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleCharacters(StAXStream2SAX.java:261)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:171)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:120)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:674)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:723)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:336)
    at xml_test.XML_Test.extractXML2(XML_Test.java:698)
    at xml_test.XML_Test.main(XML_Test.java:811)

The code:

InputStream stream = new FileInputStream("C:\\myFile.xml");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader reader = factory.createXMLStreamReader(stream);

TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();

String extractPath = "/root";
String path = "";

while(reader.hasNext()) {
    reader.next();

    if(reader.isStartElement()) {
        path += "/" + reader.getLocalName();

        if(path.equals(extractPath)) {
            StringWriter writer = new StringWriter();
            StAXSource src = new StAXSource(reader);
            StreamResult res = new StreamResult(writer);
            t.transform(src, res); // Exception thrown

            System.out.println(writer.toString());

            path = path.substring(0, path.lastIndexOf("/"));
        }
    }
    else if(reader.isEndElement()) {
        path = path.substring(0, path.lastIndexOf("/"));
    }
}

The XML that raises the error:

<foo><![CDATA[]]></foo>

Can I make the Transformer to just ignore that? Or what would another implementation look like? I'm not able to change the input XML!

like image 369
halloei Avatar asked Jan 19 '15 15:01

halloei


2 Answers

This is an issue on Xerces implementation, check this: https://issues.apache.org/jira/browse/XERCESJ-1033

It seems that empty CDATA are not supposed to exist, so the only advices that I can give it to you is:

  1. Change the XML parser implementation
  2. Remove empty CDATA from source files (replace "<![CDATA[]]>" with "")
    or put a whitespace in CDATA e.g. <![CDATA[ ]]>

I add some examples with another implementation.

Jaxb

In Jaxb you map your XML to POJO's in a simple manner.

For example, if you have the next xml file in c:\myFile.xml:

<root>
  <foo><![CDATA[]]></foo>
  <foo><![CDATA[some data here]]></foo>
</root>

You could have the next POJO's:

@XmlRootElement
public class Root {

  @XmlElement(name="foo")
  privateList<Foo> foo;

  public List<Foo> getFooList() {
    return foo;
  }

  public void setFooList(List<Foo> fooList) {
    this.foo = fooList;
  }

}

@XmlType(name = "foo")
public class Foo {

  @XmlValue
  private String content;

  @Override
  public String toString() {
    return content;
  }

}

And then parse from XML to Object with the next snippet:

    public static void main(String[] args) {
    try {

        File file = new File("C:\\myFile.xml");
        JAXBContext jaxbContext = JAXBContext.newInstance(Root.class);

        Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
        Root root = (Root) jaxbUnmarshaller.unmarshal(file);

        for (Foo foo : root.getFooList()) {
            System.out.println(String.format("Foo content: |%s|", foo));
        }

    } catch (JAXBException e) {
        e.printStackTrace();
    }

}

I tested this and raises no error.

like image 197
Carlos Verdes Avatar answered Sep 30 '22 10:09

Carlos Verdes


I encountered this error with two builds of the same application, one build exhibiting the error when handing empty <![CDATA[]]> and the other not.

The difference turned out to be that the broken build was using Xerces (embedded in jre), while the working build had an extra dependency added on the classpath, https://mvnrepository.com/artifact/org.codehaus.woodstox/woodstox-core-asl.

Relevant part of the stacktrace for the broken build would be

java.lang.Exception
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getTextCharacters(XMLStreamReaderImpl.java:1144)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleCharacters(StAXStream2SAX.java:242)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:152)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:101)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:679)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
        at com.sun.org.apache.xerces.internal.jaxp.validation.StAXValidatorHelper.validate(StAXValidatorHelper.java:107)
        at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:123)
        at javax.xml.validation.Validator.validate(Validator.java:124)

While for the working build

java.lang.Exception
    at com.ctc.wstx.sr.BasicStreamReader.getTextCharacters(BasicStreamReader.java:894)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.handleCharacters(StAXStream2SAX.java:242)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.bridge(StAXStream2SAX.java:152)
    at com.sun.org.apache.xalan.internal.xsltc.trax.StAXStream2SAX.parse(StAXStream2SAX.java:101)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:679)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
    at com.sun.org.apache.xerces.internal.jaxp.validation.StAXValidatorHelper.validate(StAXValidatorHelper.java:107)
    at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:123)
    at javax.xml.validation.Validator.validate(Validator.java:124)

This Q/A helped me to get "comfortable" with Woodstox What is the relation between fasterxml(jackson-dataformat-xml) and Woodstox?.

like image 34
user7610 Avatar answered Sep 30 '22 09:09

user7610