Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List of BOM characters

Is there a list of possible BOM characters that are used? So far I have encountered:

\x00\x00\xfe\xff    UTF-32, big-endian
\xff\xfe\x00\x00    UTF-32, little-endian
\xfe\xff            UTF-16, big-endian
\xff\xfe            UTF-16, little-endian
\xef\xbb\xbf        UTF-8

Are there any additional ones that I'm missing?


1 Answers

Short answer: no, you've covered them.

According to the Unicode spec, UTF-8, UTF-16, and UTF-32 are the 3 general types of encodings. They actually list UTF-16, UTF-16LE, and UTF-16BE as separate encodings, and similarly UTF-32, UTF-32LE, and UTF-32BE.

It's important to know that if the character stream is explicitly coded in one of the LE or BE forms, you must interpret the leading 0xFFFE as U+FEFF Zero Width No-Break Space. I.e.

UTF-16BE  initial FE FF is treated as U+FEFF
UTF-16LE  initial FF FE is treated as U+FEFF
UTF-32BE  initial 00 00 FE FF is treated as U+FEFF
UTF-32LE  initial FF FE 00 00 is treated as U+FEFF

See http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G2212 for more details.

like image 80
J Quinn Avatar answered Nov 20 '25 17:11

J Quinn