I am trying to parse an XML file with <?version = 1.0, encoding = UTF-8>
but ran into an error message invalid byte 2 of 2-byte UTF-8 sequence
. Does anybody know what caused this problem?
Why does an UTF-8 invalid byte sequence error happen? Ruby's default encoding since 2.0 is UTF-8. This means that Ruby will treat any string you input as an UTF-8 encoded string unless you tell it explicitly that it's encoded differently.
Explanation: This error occurs when you send text data, but either the source encoding doesn't match that currently set on the database, or the text stream contains binary data like NUL bytes that are not allowed within a string.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367.
Most commonly it's due to feeding ISO-8859-x
(Latin-x, like Latin-1) but parser thinking it is getting UTF-8
. Certain sequences of Latin-1 characters (two consecutive characters with accents or umlauts) form something that is invalid as UTF-8
, and specifically such that based on first byte, second byte has unexpected high-order bits.
This can easily occur when some process dumps out XML
using Latin-1, but either forgets to output XML
declaration (in which case XML
parser must default to UTF-8
, as per XML
specs), or claims it's UTF-8
even when it isn't.
Either the parser is set for UTF-8 even though the file is encoded otherwise, or the file is declared as using UTF-8 but it really doesn't.
You could try to change default character encoding used by String.getBytes() to utf-8. Use VM option -Dfile.encoding=utf-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With