Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

invalid byte 2 of 2-byte UTF-8 sequence

Tags:

java

xml

encoding

I am trying to parse an XML file with <?version = 1.0, encoding = UTF-8> but ran into an error message invalid byte 2 of 2-byte UTF-8 sequence. Does anybody know what caused this problem?

like image 640
flyingfromchina Avatar asked Mar 10 '10 22:03

flyingfromchina


People also ask

What does an invalid start byte sequence in UTF-8 mean?

Why does an UTF-8 invalid byte sequence error happen? Ruby's default encoding since 2.0 is UTF-8. This means that Ruby will treat any string you input as an UTF-8 encoded string unless you tell it explicitly that it's encoded differently.

What is an invalid byte?

Explanation: This error occurs when you send text data, but either the source encoding doesn't match that currently set on the database, or the text stream contains binary data like NUL bytes that are not allowed within a string.

How many bytes is a string in UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367.


3 Answers

Most commonly it's due to feeding ISO-8859-x (Latin-x, like Latin-1) but parser thinking it is getting UTF-8. Certain sequences of Latin-1 characters (two consecutive characters with accents or umlauts) form something that is invalid as UTF-8, and specifically such that based on first byte, second byte has unexpected high-order bits.

This can easily occur when some process dumps out XML using Latin-1, but either forgets to output XML declaration (in which case XML parser must default to UTF-8, as per XML specs), or claims it's UTF-8 even when it isn't.

like image 179
StaxMan Avatar answered Oct 22 '22 17:10

StaxMan


Either the parser is set for UTF-8 even though the file is encoded otherwise, or the file is declared as using UTF-8 but it really doesn't.

like image 42
Ignacio Vazquez-Abrams Avatar answered Oct 22 '22 17:10

Ignacio Vazquez-Abrams


You could try to change default character encoding used by String.getBytes() to utf-8. Use VM option -Dfile.encoding=utf-8.

like image 6
atott Avatar answered Oct 22 '22 17:10

atott