Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding Issue Causing MalformedByteSequenceException in Xerces UTF8Reader

I'm encountering com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException with an XML file. I stepped through the Xerces code with a debugger and narrowed down the area where this was ocurring. I was able to determine that by removing the "smart quote" characters in the document, the document becomes parseable.

The document came with no DTD. Notepad++ pegs it as "ANSI as UTF-8". Firefox pegs it as "Western". I recall from a not-so-breathtaking lecture in college that UTF-8 was designed to be backward-compatible with single-byte encoding systems. I also see that on this chart, the byte sequence e2 80 9d is, in fact, representative of a "RIGHT DOUBLE QUOTATION MARK", but even though I can't see an encoding problem, I'm thinking there is one.

The exception message I'm getting from Xerces is Invalid byte 3 of 3-byte UTF-8 sequence. It's getting thrown from the invalidByte(3, 3, b2) call on line 435 of UTF8Reader. When I try to fully understand the logic of this method, my brain begins to melt out of my ears a little so I could be missing something, but as I mentioned above byte 3 (0x90). at least of the sequence above, is valid according to the UTF-8 table.

Here is the segment of the file where the double quote occurs shown in a hex editor:XML segment

I have tried the following:

  • Forcing the String to be loaded using UTF-8 via Charset.forName("UTF-8")
  • Adding the DTD <?xml version="1.0" encoding="UTF-8"?>
  • Opening the file in Notepad++ and encoding it as UTF-8 through its UI
  • Various combinations of the above, sometimes repeatedly

The byte indicated as invalid seems to be 63 (0x3F?) eclipse screenshot

I've also tried adding this smart quote character to a document that was previously parseable. As expected, it makes the parser throw up the same exception.

Stack trace:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)

...

Update: I still need to find a way to safely convert this to a String. I've encoded the file as UTF-8 using Notepad++. The code below successfully loads the bytes into a String (I can see read the XML in the String when when debugging in Eclipse), but now I'm getting MalformedByteSequenceException with different parameters. This time, I can post both the code and XML I'm using:

File file = new File("ccd.xml");

byte[] ccdBytes = org.apache.commons.io.FileUtils.readFileToByteArray(file);
String ccdString = new String(ccdBytes, Charset.forName("UTF-8"));

CDAUtil.load(new ByteArrayInputStream(IOUtils.toByteArray(ccdString))); //method that's doing the parsing

Stack Trace:

Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at org.openhealthtools.mdht.emf.runtime.resource.impl.FleXMLLoadImpl.load(FleXMLLoadImpl.java:55)
    at org.eclipse.emf.ecore.xmi.impl.XMLResourceImpl.doLoad(XMLResourceImpl.java:180)
    at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1494)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:268)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:250)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:238)

However,

CDAUtil.load(new FileInputStream(new File("ccd.xml")));

works

like image 881
kwikness Avatar asked Feb 10 '23 22:02

kwikness


2 Answers

You did not told us how you pass the file to Xerces. You can do in a different ways with different results. You can read a more detailed explanation about xml encoding issues here

I suggest you to do the following one:

  1. Open the file with notedpad++ and add <?xml version="1.0" encoding="UTF-8"?> as first line if is missing
  2. In Notepad++ do a convertion to UTF-8 (without bom) (it should be in the format menu but I'm using the italian version of notepad++ so I'm guessing about menu translation)
  3. Save the file
  4. In Java open it as an InputStream,i.e. pass to xml parser an InputStream and NOT as a Reader subclass

This should solve the problem, if you can past your code that pass the file to parser it's easier to find the problem.

The steps solve the problem because the first line in xml with the encoding declaration is considered by the parser only if you are using an InputStream i.e a stream of bytes. If you read a stream of bytes you need an encoding declaration in order to specify how to convert bytes into chars.

If you are passing the a String the first line is useless because you are passing a stream of chars and there is no need of encoding.

If you want to use a String you must read the file as a InputStream and convert to a Reader specifying a charset (some thing like InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");)

My guess is that you were getting the error because you did not specify the charset and Java picked up the one of your os (Windows-1252).

like image 194
Giovanni Avatar answered Feb 15 '23 10:02

Giovanni


I'm only able to get that error message when I have an actual UTF-8 coding error in my input file. Therefore, I assume that there is an actual error in the file somewhere that you can't find.

Here's my test code:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

public class ParseAXml {
  public static void main(String argv[]) throws Exception {
    String xmlFile = argv[0];
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document document = builder.parse(xmlFile);
    System.out.println("Parsed Successfully");
  }
}

When I pass it a correct file - containing smart quotes as you have it - I get the expected Parsed Successfully message. Here's my regular test file:

$ hexdump -C tmp.xml
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  74 68 69 6e 67 3e 3c 61  |F-8"?>.<thing><a|
00000030  3e 54 68 69 73 20 e2 80  9c 71 75 6f 74 65 e2 80  |>This ...quote..|
00000040  9d 20 63 6f 75 6c 64 20  67 65 74 20 74 72 69 63  |. could get tric|
00000050  6b 79 3c 2f 61 3e 3c 2f  74 68 69 6e 67 3e 0a     |ky</a></thing>.|
0000005f

When I test a file with an error in it - made by mangling the byte at offset 0x38 - I get the exception you see:

$ java ParseAXml tmp.err.xml
Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2807)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
    at ParseAXml.main(ParseAXml.java:10)

So to help you out, I wrote a short java program that attempts to find the malformed byte:

import java.nio.*;
import java.nio.charset.*;
import java.io.*;

public class FindBadUTF8 {
  public static void main(String argv[]) throws Exception {
    String filename = argv[0];
    InputStream inStream = new FileInputStream(filename);

    CharsetDecoder d=Charset.forName("UTF-8").newDecoder();
    CharBuffer out = CharBuffer.allocate(1);
    ByteBuffer in = ByteBuffer.allocate(10);
    in.clear();
    long offset = 0L;
    while (true) {
      int read = inStream.read();
      if (read != -1) {
        in.put((byte)read);
      }
      out.clear();
      in.flip();
      CoderResult cr = d.decode(in, out, (read == -1));
      if (cr.isError()) {
        if (read != -1) {
          System.out.println("Error at offset " + offset + ": " + cr);
          return;
        } else {
          System.out.println("Error at end-of-file: " + cr);
          return;
        }
      }
      if (cr.isUnderflow()) {
        in.position(in.limit());
        in.limit(in.capacity());
      } else {
        in.clear();
      }
      if (read == -1) {
        break;
      }
      offset += 1L;
    }
    System.out.println("OK");
  }
}

That program, when run against my sample file with an error in it gives me this:

$ java FindBadUTF8 tmp.err.xml
Error at offset 56: MALFORMED[2]

And indeed, offset 56 (which is 0x38 in hex) is the byte I mangled:

$ hexdump -C tmp.err.xml
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  74 68 69 6e 67 3e 3c 61  |F-8"?>.<thing><a|
00000030  3e 54 68 69 73 20 e2 80  ff 71 75 6f 74 65 e2 80  |>This ...quote..|
00000040  9d 20 63 6f 75 6c 64 20  67 65 74 20 74 72 69 63  |. could get tric|
00000050  6b 79 3c 2f 61 3e 3c 2f  74 68 69 6e 67 3e 0a     |ky</a></thing>.|
0000005f
like image 34
Daniel Martin Avatar answered Feb 15 '23 11:02

Daniel Martin