Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?

The byte order mark (BOM) for UTF-8 is EF BB BF, as noted in section 23.8 of the Unicode 9 specification (search for "signature").

Many solutions in Java to remove this is just a simple one-line code:

 replace("\uFEFF", "")

I don't understand this why this works.

Here is my test code. I check the binary after calling String#replace where I find that EF BB BF is INDEED removed. See this code run live at IdeOne.com.

So magic. Why does this work?

@Test
public void shit() throws Exception{
    byte[] b = new byte[]{-17,-69,-65, 97,97,97};//EF BB BF 61 61 61
    char[] c = new char[10];
    new InputStreamReader(new ByteArrayInputStream(b),"UTF-8").read(c);
    byte[] bytes = new StringBuilder().append(c).toString().replace("\uFEFF", "").getBytes();//
    for(byte bt: bytes){//61 61 61, we can see EF BB BF is indeed removed
        System.out.println(bt);
    }
}
like image 399
aaron.chu Avatar asked Jan 18 '19 03:01

aaron.chu


People also ask

What is BOM Ufeff?

The \ufeff character is a byte order mark (BOM) and is interpreted as a zero-width non-breaking space. The BOM character causes an issue when we use an incorrect codec to decode bytes that were encoded using a different codec. If you have a string that contains a BOM character, use the str.

Should you use UTF-8 with BOM?

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.

What is the difference between UTF-8 and UTF-8 BOM?

There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.


2 Answers

The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not recommended mandatory[1]).

from Wikipedia

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ...
...
The BOM is encoded in the same scheme as the rest of the document ...

Which means this special character (\uFEFF) must also be encoded in UTF-8.

UTF-8 can encode Unicode code points in one to four bytes.

  • code points which can be represented with 7 bits are encoded in one byte, the highest bit is always zero 0xxx xxxx
  • all other code points encoded in multiple bytes depending on the number of bits, the left set bits of the first byte represent the number of bytes used for the encoding, e.g. 110x xxxx means the encoding is represented by two bytes, continuation bytes always start with 10xx xxxx (the x bits can be used for the code points)

The code points in the range U+0000 - U+007F can be encoded with one byte.
The code points in the range U+0080 - U+07FF can be encoded with two bytes. The code points in the range U+0800 - U+FFFF can be encoded with three bytes.

A detailed explanation is on Wikipedia

For the BOM we need three bytes.

hex    FE       FF
binary 11111110 11111111

encode the bits in UTF-8

pattern for three byte encoding 1110 xxxx  10xx xxxx  10xx xxxx
the bits of the code point           1111    11 1011    11 1111
result                          1110 1111  1011 1011  1011 1111
in hex                          EF         BB         BF

EF BB BF sounds already familiar. ;-)

The byte sequence EF BB BF is nothing else than the BOM encoded in UTF-8.

As the byte order mark has no meaning for UTF-8 it is not used in Java.

encoding the BOM character as UTF-8

jshell> "\uFEFF".getBytes("UTF-8")
$1 ==> byte[3] { -17, -69, -65 }  // EF BB BF

Hence when the file is read the byte sequence gets decoded to \uFEFF.

For encoding e.g. UTF-16 the BOM is added

jshell> " ".getBytes("UTF-16")
$2 ==> byte[4] { -2, -1, 0, 32 }  // FE FF + the encoded SPACE

[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf

Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings.

like image 141
SubOptimal Avatar answered Oct 17 '22 03:10

SubOptimal


InputStreamReader is decoding the UTF-8 encoded byte sequence (b) into UTF-16BE, and in the process translates the UTF-8 BOM to UTF-16BE BOM (\uFEFF). UTF-16BE is selected as the target encoding because Charset defaults to this behavior:

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

The UTF-16 charsets are specified by RFC 2781; the transformation formats upon which they are based are specified in Amendment 1 of ISO 10646-1 and are also described in the Unicode Standard.

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

See JLS 3.1 to understand why the internal encoding of String is UTF-16:

https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.

String#getBytes() returns a byte sequence in the platform's default encoding, which appears to be UTF-8 for your system.

Summary

The sequence EF BB BF (UTF-8 BOM) is translated to FE FF (UTF-16BE BOM) when decoding the byte sequence into a String using InputStreamReader, because the encoding of java.lang.String with a default Charset is UTF-16 BE in the presence of a BOM. After replacing the UTF-16BE BOM and calling String#getBytes() the string is decoded into UTF-8 (the default charset for your platform) and you see your original byte sequence without a BOM.

like image 29
Chris Hutchinson Avatar answered Oct 17 '22 05:10

Chris Hutchinson