Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading XML file returns wrong characters

Tags:

java

xml

readfile

I have an XML file with thousands of tags to read their text content, as in the screenshot below :

XML file to read

I am trying to read the text content of all the "word" tags using this code :

String filePath = "...";
File xmlFile = new File( filePath );

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
NodeList categoryNodes = domObject.getElementsByTagName( "category" );   // Get all the <category> nodes.

for (int s = 0; s < categoryNodes.getLength(); s++) {    //Loop on the <category> nodes.
    String categoryName = categoryNodes.item(s).getAttributes().getNamedItem( "name" ).getNodeValue(); 

    if( selectedCategoryName.equals( categoryName ) ) {  //get its words.
        NodeList wordsNodes = categoryNodes.item(s).getChildNodes();

        for( int i = 0; i < wordsNodes.getLength(); i++ ) {
            if( wordsNodes.item( i ).getNodeType() != Node.ELEMENT_NODE ) continue;
            String word = wordsNodes.item( i ).getTextContent();
            categoryWordsList.add( word );  // Some words are read wrong !!
        }

        break;
    }
}

But for some reason many words are being read in wrong manner, examples :

"AMK6780KBU" is read as "9826</word"

"ASSI.ABR30326" is read as "rd>ASSI.AEP26"

"ASSI.25066" is read as "SI.4268</6"

It might be because the file size is big. If i just add some empty lines or remove some empty lines from the XML file, other words will be read wrong than the ones mentioned above, which is a strange thing !

You can download the XML file from here.

like image 292
Brad Avatar asked Mar 20 '13 12:03

Brad


2 Answers

Solution

See below :-)

What I tried in the process

Changing the XML version from 1.1 -> 1.0 fixed the problem for me. I'm using Java 1.6.0_33 (as @orique pointed out in the comments).

In my tests there are definitely issues with corruption after a certain number of nodes. I narrowed it down to somewhere around ASSI.MTK69609. Removing everything, including that line fixed the corruption of the previous words.

The corruption is also resolved by simply changing the declaration to:

<?xml version="1.0">

and I saw zero corruption using the entire original source XML.

Similarly if you leave the version at 1.1 but remove whitespace nodes from the source, the result is as expected, for example:

    <word>ASSI.MTK68490</word>
    <word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>

results in the desired output and

    <word>ASSI.MTK68490</word>
    <word>ASSI.MTK6862617</word>
    <word>ASSI.MTK693115</word>
    <word>ASSI.MTK69609</word>

is corrupted.

Removing some end-of-line "nodes" also corrected the problem, for example

    <word>ASSI.MTK693115</word><word>ASSI.MTK69609</word>

So it was all pointing towards a bug, but where...? Eventually it clicked! Xerces

The version of Xerces shipped with Java 1.6 (and probably 1.7) is old, old, old and buggy (for example #6760982). In fact, I can break my test class by simply adding:

Document domObject = db.parse( xmlFile );
domObject.normalizeDocument(); // <-- causes following Exception

Exception in thread "main" java.lang.NullPointerException
    at com.sun.org.apache.xerces.internal.util.XML11Char.isXML11ValidNCName(XML11Char.java:340)

There have been many defects fixed for XML 1.1, so on a hunch I downloaded the latest version Xerces2 Java 2.11.0.

Simply running with the most recent version resulted in the expected uncorrupted output.

java -classpath .;xercesImpl.jar;xml-apis.jar Foo > foo.txt
like image 132
andyb Avatar answered Oct 11 '22 00:10

andyb


We have noticed that getTextContent() is buggy on some Windows implementations.

Our workaround is to do something like this

            // getTextContent is buggy on some Java Windows Implementations
            if ( n.getNodeType(  ) == Node.ELEMENT_NODE ) {

                results [ i ] = (String) xPathFunction.evaluate( "./text()", n, XPathConstants.STRING );
            } else {  //Node.TEXT_NODE

                results [ i ] = n.getNodeValue(  );
            }

xPathFunction is an javax.xml.xpath.XPath. Expensive, but works reliably.

Actually in your case I would directly use an XPath and call something like,

NodeList l = (NodeList) xPathFunction.evaluate( "/categories/category/word/text()", domObject, XPathConstants.NODESET )

EDIT

Beats me! On OSX, Java 1.6.0_43, I get the same behaviour. In case there was any doubt the DOM model is buggy in Java... The wrong values seem to reliably appear at certain intervals, which looks like some bytes buffer overrun. I never got an OOM error.

Here is what I have unsuccessfully tried:

  • word.getFirstChild().getNodeValue(); instead of word.getTextContent(); -> no change in behaviour
  • use an InputSource as an input into the DocumentBuilder instead of using a File
  • run an XPath ("/categories/category[@name='Category1']/word/text()") instead of looping over the nodes and manually traversing their children
  • run the same Test using Saxon as the XPath engine
  • check for "strange" characters in the XML file

I believe the DocumentBuilder is the culprit. It is a memory hog.

Your next best chance is to go for a SAX Parser or any other streaming parser. Since your data model is small and very simple, the implementation should be easy. To further ease implementation, you may try XMLDog. We use a slightly modified version to parse gigabyte size XML files successfully.

If you ever find the issue, please update this post.

like image 42
Bruno Grieder Avatar answered Oct 11 '22 01:10

Bruno Grieder