The byte order mark (BOM) for UTF-8 is EF BB BF
, as noted in section 23.8 of the Unicode 9 specification (search for "signature").
Many solutions in Java to remove this is just a simple one-line code:
replace("\uFEFF", "")
I don't understand this why this works.
Here is my test code. I check the binary after calling String#replace
where I find that EF BB BF is INDEED removed. See this code run live at IdeOne.com.
So magic. Why does this work?
@Test
public void shit() throws Exception{
byte[] b = new byte[]{-17,-69,-65, 97,97,97};//EF BB BF 61 61 61
char[] c = new char[10];
new InputStreamReader(new ByteArrayInputStream(b),"UTF-8").read(c);
byte[] bytes = new StringBuilder().append(c).toString().replace("\uFEFF", "").getBytes();//
for(byte bt: bytes){//61 61 61, we can see EF BB BF is indeed removed
System.out.println(bt);
}
}
The \ufeff character is a byte order mark (BOM) and is interpreted as a zero-width non-breaking space. The BOM character causes an issue when we use an incorrect codec to decode bytes that were encoded using a different codec. If you have a string that contains a BOM character, use the str.
The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.
There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.
The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not recommended mandatory[1]).
from Wikipedia
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ...
...
The BOM is encoded in the same scheme as the rest of the document ...
Which means this special character (\uFEFF
) must also be encoded in UTF-8.
UTF-8 can encode Unicode code points in one to four bytes.
0xxx xxxx
110x xxxx
means the encoding is represented by two bytes, continuation bytes always start with 10xx xxxx
(the x
bits can be used for the code points)The code points in the range U+0000 - U+007F
can be encoded with one byte.
The code points in the range U+0080 - U+07FF
can be encoded with two bytes.
The code points in the range U+0800 - U+FFFF
can be encoded with three bytes.
A detailed explanation is on Wikipedia
For the BOM we need three bytes.
hex FE FF
binary 11111110 11111111
encode the bits in UTF-8
pattern for three byte encoding 1110 xxxx 10xx xxxx 10xx xxxx
the bits of the code point 1111 11 1011 11 1111
result 1110 1111 1011 1011 1011 1111
in hex EF BB BF
EF BB BF
sounds already familiar. ;-)
The byte sequence EF BB BF
is nothing else than the BOM encoded in UTF-8.
As the byte order mark has no meaning for UTF-8 it is not used in Java.
encoding the BOM character as UTF-8
jshell> "\uFEFF".getBytes("UTF-8")
$1 ==> byte[3] { -17, -69, -65 } // EF BB BF
Hence when the file is read the byte sequence gets decoded to \uFEFF
.
For encoding e.g. UTF-16 the BOM is added
jshell> " ".getBytes("UTF-16")
$2 ==> byte[4] { -2, -1, 0, 32 } // FE FF + the encoded SPACE
[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf
Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings.
InputStreamReader is decoding the UTF-8 encoded byte sequence (b) into UTF-16BE, and in the process translates the UTF-8 BOM to UTF-16BE BOM (\uFEFF). UTF-16BE is selected as the target encoding because Charset defaults to this behavior:
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
The UTF-16 charsets are specified by RFC 2781; the transformation formats upon which they are based are specified in Amendment 1 of ISO 10646-1 and are also described in the Unicode Standard.
The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:
When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
See JLS 3.1 to understand why the internal encoding of String is UTF-16:
https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.
String#getBytes() returns a byte sequence in the platform's default encoding, which appears to be UTF-8 for your system.
Summary
The sequence EF BB BF (UTF-8 BOM) is translated to FE FF (UTF-16BE BOM) when decoding the byte sequence into a String using InputStreamReader, because the encoding of java.lang.String with a default Charset is UTF-16 BE in the presence of a BOM. After replacing the UTF-16BE BOM and calling String#getBytes() the string is decoded into UTF-8 (the default charset for your platform) and you see your original byte sequence without a BOM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With