Is the byte order marker really a valid identifier?

Tags:

C++11 makes numerous additions to the list of Unicode code points allowed in identifiers (§E). This includes the byte order mark, which is included in range FE47-FFFD.

Consulting a character browser, this range includes a whole bunch of random stuff, beginning between WHITE SESAME DOT and PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET, including some "small punctuation," fancy Arabic, BOM appears here, the halfwidth and fullwidth Asian characters, and finally including the REPLACEMENT CHARACTER which is usually used to indicate broken text rendering.

Surely this is some kind of error. They felt the need to exclude "sesame dots," whatever those are, but the byte order mark a.k.a. deprecated zero-width non-breaking space is fair game? When there's another zero-width non-breaking space a.k.a. word joiner which was also made an acceptable identifier in C++11?

It seems the most elegant interpretation of the Standard, to define any form of Unicode as the source character set, is to begin the file after an optional BOM. But it's also possible for the user to legitimately begin the file by using a BOM for an identifier. It's just ugly.

Am I missing something, or is this a no-brainer defect?

264

asked Nov 22 '11 13:11

Potatoswatter

2 Answers

My attempt at an interpretation: The standard only lays out the rules for an abstract piece of source code.

Your compiler comes with a notion of a "source character set", which tells it how a concrete source code file is encoded. If that encoding is "UTF-16" (i.e. without the BE/LE specifier, and thus requiring a BOM), then the BOM is not part of the codepoint stream, but just of the file envelope.

Only after the file has been decoded does the codepoint stream get passed on to the compiler proper.

161

answered Oct 07 '22 01:10

Kerrek SB

First I want to say that the problem you're describing is unlikely to matter. If your compiler requires a UTF-8 BOM in order to treat a file as using the UTF-8 encoding, then you cannot have a file that lacks the UTF-8 BOM but where the source begins with U+FEFF in UTF-8 encoding. If your compiler does not require the UTF-8 BOM in order to process UTF-8 files, then you should not put UTF-8 BOMs in your source files (In the words of Michael Kaplan, "STOP USING WINDOWS NOTEPAD").

But yes, if the compiler strips BOMs then you can get behavior different from the intended. If you want (unwisely) to begin a source file with U+FEFF but (wisely) refuse to put BOMs in your source then you can use the universal character name: \uFEFF.

Now onto my answer.

The retrieval of physical source file characters is not defined by the C++ standard. Declaring source file encoding to the compiler, file formats for storing physical source characters, and mapping physical source file characters to the basic source charset is all implementation defined. Support for treating U+FEFF at the begining of a source file as an encoding hint lies in this area.

If a compiler supports an optional UTF-8 BOM and cannot distiguish between a file where the optional BOM is supplied from one where it is not but the source code begins with U+FEFF then this is a defect in the compiler design, and more broadly in the idea of the UTF-8 BOM itself.

In order to interpret bytes of data as text the text encoding must be known, determined unambiguously by an authoritative source. (Here's an article that makes this point.) Unfortunately back before this principal was understood data was already being transmitted between systems and people had to deal with data that was ostensibly text but for which the encoding wasn't necessarily known. So they came up with a very bad solution: guessing. A set of techniques involving the UTF-8 BOM is one of the methods of guessing that was developed.

The UTF-8 BOM was chosen as an encoding hint for a few reasons. First, it has no effect on visible text and so can be deliberately inserted into text without having a visible effect. Second, non-UTF-8 files are very unlikely to include bytes that will be mistaken for the UTF-8 BOM. However these don't prevent using a BOM from being anything other than guessing. There's nothing that says an ISO-8859-1 plain text file can't start with U+00EF U+00BB U+00BF, for example. This sequence of characters encoded in ISO-8859-1 shares the same encoding as U+FEFF encoded in UTF-8: 0xEF 0xBB 0xBF. Any software that relies on detecting a UTF-8 BOM will be confused by such an ISO-8859-1 file. So a BOM can't be an authoritative source even though guessing based on it will almost always work.

Aside from the fact that using the UTF-8 BOM amounts to guessing, there's a second reason that it's a terrible idea. That lies in the mistaken assumption that changes to text which have no effect on the visual display of that text have no effect at all. This assumption may be wrong whenever text is used for something other than visual display, such as when it's used in text meant to be read by a computer as source code is.

So in conclusion: This problem with the UTF-8 BOM is not caused by the C++ specification; and unless you're absolutely forced to interact with brain-dead programs that require it (in other words, programs that can only handle the subset of Unicode strings which begin with U+FEFF), do not use the UTF-8 BOM.

answered Oct 07 '22 01:10

bames53

Related questions
                            
                                Eclipse metrics plugin for C++ [closed]
                            
                                Is C++ STL thread-safe for distinct containers (using STLport implementation)?
                            
                                What's the best (for speed) arbitrary-precision library for C++? [duplicate]
                            
                                Framework/tool for processing C++ unit tests with numerical output [closed]
                            
                                Is it necessary to define move constructors from different classes?
                            
                                New C++11 range-for (foreach) syntax: which compilers support it?
                            
                                Can GCC produce struct/class name mismatches like VS?
                            
                                Return C++ object to Java
                            
                                Why is there no gcc/g++ warning for unused temporaries?
                            
                                C++ Idiomatic Type Traits
                            
                                Performance bottleneck - using Visual Studio
                            
                                What is the best standard data structure to build a Graph?
                            
                                reusing the copy-and-swap idiom
                            
                                Compiler chosing prefix ++ when postfix is missing - who says?
                            
                                "this" pointer getting corrupted in stack trace
                            
                                How does the compiler internally solve the diamond problem in C++?
                            
                                Pointer to a member function in an inaccessible base
                            
                                Multithreaded data processing pipeline in Qt
                            
                                Why does GCC not auto-vectorize this loop?
                            
                                clang "hello, world!" link errors in windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is the byte order marker really a valid identifier?

Tags:

c++

c++11

unicode

byte-order-mark

Potatoswatter

People also ask

2 Answers

Kerrek SB

bames53

Recent Activity

Donate For Us