Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?

Tags:

2 Answers

The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not ~~recommended~~ mandatory[1]).

from Wikipedia

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ...
...
The BOM is encoded in the same scheme as the rest of the document ...

Which means this special character (\uFEFF) must also be encoded in UTF-8.

UTF-8 can encode Unicode code points in one to four bytes.

code points which can be represented with 7 bits are encoded in one byte, the highest bit is always zero 0xxx xxxx
all other code points encoded in multiple bytes depending on the number of bits, the left set bits of the first byte represent the number of bytes used for the encoding, e.g. 110x xxxx means the encoding is represented by two bytes, continuation bytes always start with 10xx xxxx (the x bits can be used for the code points)

The code points in the range U+0000 - U+007F can be encoded with one byte.
The code points in the range U+0080 - U+07FF can be encoded with two bytes. The code points in the range U+0800 - U+FFFF can be encoded with three bytes.

A detailed explanation is on Wikipedia

For the BOM we need three bytes.

hex    FE       FF
binary 11111110 11111111

encode the bits in UTF-8

pattern for three byte encoding 1110 xxxx  10xx xxxx  10xx xxxx
the bits of the code point           1111    11 1011    11 1111
result                          1110 1111  1011 1011  1011 1111
in hex                          EF         BB         BF

EF BB BF sounds already familiar. ;-)

The byte sequence EF BB BF is nothing else than the BOM encoded in UTF-8.

As the byte order mark has no meaning for UTF-8 it is not used in Java.

encoding the BOM character as UTF-8

jshell> "\uFEFF".getBytes("UTF-8")
$1 ==> byte[3] { -17, -69, -65 }  // EF BB BF

Hence when the file is read the byte sequence gets decoded to \uFEFF.

For encoding e.g. UTF-16 the BOM is added

jshell> " ".getBytes("UTF-16")
$2 ==> byte[4] { -2, -1, 0, 32 }  // FE FF + the encoded SPACE

[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf

Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings.

141

answered Oct 17 '22 03:10

SubOptimal

InputStreamReader is decoding the UTF-8 encoded byte sequence (b) into UTF-16BE, and in the process translates the UTF-8 BOM to UTF-16BE BOM (\uFEFF). UTF-16BE is selected as the target encoding because Charset defaults to this behavior:

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

The UTF-16 charsets are specified by RFC 2781; the transformation formats upon which they are based are specified in Amendment 1 of ISO 10646-1 and are also described in the Unicode Standard.

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

See JLS 3.1 to understand why the internal encoding of String is UTF-16:

https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.

String#getBytes() returns a byte sequence in the platform's default encoding, which appears to be UTF-8 for your system.

Summary

The sequence EF BB BF (UTF-8 BOM) is translated to FE FF (UTF-16BE BOM) when decoding the byte sequence into a String using InputStreamReader, because the encoding of java.lang.String with a default Charset is UTF-16 BE in the presence of a BOM. After replacing the UTF-16BE BOM and calling String#getBytes() the string is decoded into UTF-8 (the default charset for your platform) and you see your original byte sequence without a BOM.

answered Oct 17 '22 05:10

Chris Hutchinson

Related questions
                            
                                NullPointerException on all KeyCloak Admin API Calls
                            
                                Convert Kotlin MutableMap to java.util.HashMap
                            
                                Spring boot Field required a bean of type that could not be found
                            
                                Spring boot 2.04 Jackson cannot serialize LocalDateTime to String
                            
                                employee.hashCode() Vs employee.getClass().hashcode() in Java
                            
                                Continue mapping after stream collect
                            
                                Formatting dates inside a Function<T,R>
                            
                                Groovy == operator does not reach Java equals(o) method - how is it possible?
                            
                                Java - Object Mapper - JSON Array of Number to List<Long>
                            
                                How to get the download url from Firebase Storage?
                            
                                Why does the FileReader stream read 237, 187, 191 from a textfile?
                            
                                Sharing POJOs between Java backend and an Android application
                            
                                Why stream average() method returns OptionalDouble instead of double?
                            
                                get a specific key from HashMap using java stream
                            
                                Spring Cloud git configuration -- placing repository in folder directly containing the classpath?
                            
                                Kafka Streams - SerializationException: Unknown magic byte
                            
                                set does not contain an item that equals one of its members?
                            
                                Grouping in arrayList of arrayList
                            
                                Java compareTo method beginner level
                            
                                How to write to file synchronously using java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?

Tags:

java

byte-order-mark

aaron.chu

People also ask

2 Answers

SubOptimal

Chris Hutchinson

Recent Activity

Donate For Us