Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the byte order marker really a valid identifier?

C++11 makes numerous additions to the list of Unicode code points allowed in identifiers (§E). This includes the byte order mark, which is included in range FE47-FFFD.

Consulting a character browser, this range includes a whole bunch of random stuff, beginning between WHITE SESAME DOT and PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET, including some "small punctuation," fancy Arabic, BOM appears here, the halfwidth and fullwidth Asian characters, and finally including the REPLACEMENT CHARACTER which is usually used to indicate broken text rendering.

Surely this is some kind of error. They felt the need to exclude "sesame dots," whatever those are, but the byte order mark a.k.a. deprecated zero-width non-breaking space is fair game? When there's another zero-width non-breaking space a.k.a. word joiner which was also made an acceptable identifier in C++11?

It seems the most elegant interpretation of the Standard, to define any form of Unicode as the source character set, is to begin the file after an optional BOM. But it's also possible for the user to legitimately begin the file by using a BOM for an identifier. It's just ugly.

Am I missing something, or is this a no-brainer defect?

like image 264
Potatoswatter Avatar asked Nov 22 '11 13:11

Potatoswatter


People also ask

What is the purpose of the byte order mark?

The byte order mark (BOM) is a piece of information used to signify that a text file employs Unicode encoding, while also communicating the text stream's endianness. The BOM is not interpreted as a logical part of the text stream itself, but is rather an invisible indicator at its head.

Does UTF-8 byte have an order mark?

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF ) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

How do I get rid of byte order marks?

How to remove BOM. If you want to remove the byte order mark from a source code, you need a text editor that offers the option of saving the mark. You read the file with the BOM into the software, then save it again without the BOM and thereby convert the coding. The mark should then no longer appear.

What is the difference between UTF-8 and utf16?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.


2 Answers

My attempt at an interpretation: The standard only lays out the rules for an abstract piece of source code.

Your compiler comes with a notion of a "source character set", which tells it how a concrete source code file is encoded. If that encoding is "UTF-16" (i.e. without the BE/LE specifier, and thus requiring a BOM), then the BOM is not part of the codepoint stream, but just of the file envelope.

Only after the file has been decoded does the codepoint stream get passed on to the compiler proper.

like image 161
Kerrek SB Avatar answered Oct 07 '22 01:10

Kerrek SB


First I want to say that the problem you're describing is unlikely to matter. If your compiler requires a UTF-8 BOM in order to treat a file as using the UTF-8 encoding, then you cannot have a file that lacks the UTF-8 BOM but where the source begins with U+FEFF in UTF-8 encoding. If your compiler does not require the UTF-8 BOM in order to process UTF-8 files, then you should not put UTF-8 BOMs in your source files (In the words of Michael Kaplan, "STOP USING WINDOWS NOTEPAD").

But yes, if the compiler strips BOMs then you can get behavior different from the intended. If you want (unwisely) to begin a source file with U+FEFF but (wisely) refuse to put BOMs in your source then you can use the universal character name: \uFEFF.

Now onto my answer.

The retrieval of physical source file characters is not defined by the C++ standard. Declaring source file encoding to the compiler, file formats for storing physical source characters, and mapping physical source file characters to the basic source charset is all implementation defined. Support for treating U+FEFF at the begining of a source file as an encoding hint lies in this area.

If a compiler supports an optional UTF-8 BOM and cannot distiguish between a file where the optional BOM is supplied from one where it is not but the source code begins with U+FEFF then this is a defect in the compiler design, and more broadly in the idea of the UTF-8 BOM itself.

In order to interpret bytes of data as text the text encoding must be known, determined unambiguously by an authoritative source. (Here's an article that makes this point.) Unfortunately back before this principal was understood data was already being transmitted between systems and people had to deal with data that was ostensibly text but for which the encoding wasn't necessarily known. So they came up with a very bad solution: guessing. A set of techniques involving the UTF-8 BOM is one of the methods of guessing that was developed.

The UTF-8 BOM was chosen as an encoding hint for a few reasons. First, it has no effect on visible text and so can be deliberately inserted into text without having a visible effect. Second, non-UTF-8 files are very unlikely to include bytes that will be mistaken for the UTF-8 BOM. However these don't prevent using a BOM from being anything other than guessing. There's nothing that says an ISO-8859-1 plain text file can't start with U+00EF U+00BB U+00BF, for example. This sequence of characters encoded in ISO-8859-1 shares the same encoding as U+FEFF encoded in UTF-8: 0xEF 0xBB 0xBF. Any software that relies on detecting a UTF-8 BOM will be confused by such an ISO-8859-1 file. So a BOM can't be an authoritative source even though guessing based on it will almost always work.

Aside from the fact that using the UTF-8 BOM amounts to guessing, there's a second reason that it's a terrible idea. That lies in the mistaken assumption that changes to text which have no effect on the visual display of that text have no effect at all. This assumption may be wrong whenever text is used for something other than visual display, such as when it's used in text meant to be read by a computer as source code is.

So in conclusion: This problem with the UTF-8 BOM is not caused by the C++ specification; and unless you're absolutely forced to interact with brain-dead programs that require it (in other words, programs that can only handle the subset of Unicode strings which begin with U+FEFF), do not use the UTF-8 BOM.

like image 30
bames53 Avatar answered Oct 07 '22 01:10

bames53