Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems getting XML node text in StAX XMLStreamConstants.CHARACTERS event

Tags:

java

xml

stax

While reading an XML file using StAX and XMLStreamReader, I encountered a weird problem. Not sure if its an error or I am doing something wrong. Still learning StAX.

So the problem is,

  1. In XMLStreamConstants.CHARACTERS event, when I collect node text as XMLStreamReader.getText() method.
  2. If there is &, <, > or even something hidden for instance in node text, it returns only the first part of the text string. e.g. ABC & XYZ returns only ABC

Simplified Java Source:

    // Start StaX reader
    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    try {
        XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(inStream);
        int event = xmlStreamReader.getEventType();
        while (true) {
            switch (event) {
                case XMLStreamConstants.START_ELEMENT:
                    switch (xmlStreamReader.getLocalName()) {
                        case "group":
                        // Do something
                            break;
                        case "source":
                            isSource = true;
                            break;
                        case "target":
                            isTarget = true;
                            break;
                        default:
                            isSource = false;
                            isTrans = false;
                            break;
                    }
                    break;
                case XMLStreamConstants.CHARACTERS:
                    if (srcData != null) {
                        String srcTrns = xmlStreamReader.getText();
                        if (srcTrns != null) {
                            if (isSource) {
                                // Set source text
                                isSource = false;
                            } else if (isTrans) {
                                // Set target text
                                isTrans = false;
                            }
                        }
                    }
                    break;
                case XMLStreamConstants.END_ELEMENT:
                    if (xmlStreamReader.getLocalName().equals("group")) {
                        // Add to return list
                    }
                    break;
            }
            if (!xmlStreamReader.hasNext()) {
                break;
            }
            event = xmlStreamReader.next();
        }
    } catch (XMLStreamException ex) {
        LOG.log(Level.WARNING, ex.getMessage(), MessageFormat.format("{0} {1}", ex.getCause(), ex.getLocation()));
    }

I am not quite sure what exactly I am doing wrong or how to collect complete text of the node.

Any suggestions or tips would be a great help to move on learning StAX more. :-)

like image 442
Indigo Avatar asked Feb 13 '23 08:02

Indigo


1 Answers

I have solved the problem after struggling and researching a bit.

It was a problem reading text with escaped entity references. You need to set XMLInputFactory IS_COALESCING to true

XMLInputFactory.setProperty(XMLInputFactory.IS_COALESCING, true);

Basically this tells the parser to replace internal entity references with their respective replacement text (in other words, something like decoding) and read them as normal characters.

like image 134
Indigo Avatar answered Apr 30 '23 06:04

Indigo