Unicode BOM for UTF-16LE vs UTF32-LE

Question

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:

FF FE 00 00 00 00 00 00

How can I tell if this file contains:

The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?

Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?

Mark Byers · Accepted Answer

As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.

A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.

Unicode BOM for UTF-16LE vs UTF32-LE

Tags:

character-encoding

unicode

byte-order-mark

utf-16

file-type

Edward Loper

1 Answers

Mark Byers

Recent Activity

Donate For Us

Unicode BOM for UTF-16LE vs UTF32-LE

Tags:

character-encoding

unicode

byte-order-mark

utf-16

file-type

Edward Loper

1 Answers

Mark Byers

Related questions

Recent Activity

Donate For Us