I'm pretty sure that Visual C++ 2015 has a bug here, but I don't feel 100% sure.
Code:
// Encoding: UTF-8 with BOM (required by Visual C++).
#include <stdlib.h>
auto main()
-> int
{
auto const s = L""
"𐐷 is not in the Unicode BMP!";
return s[0] > 256? EXIT_SUCCESS : EXIT_FAILURE;
}
Result with g++:
[H:\scratchpad\simple_text_io] > g++ --version | find "++" g++ (i686-win32-dwarf-rev1, Built by MinGW-W64 project) 6.2.0 [H:\scratchpad\simple_text_io] > g++ compiler_bug_demo.cpp [H:\scratchpad\simple_text_io] > run a Process exit code = 0. [H:\scratchpad\simple_text_io] > _
Result with Visual C++:
[H:\scratchpad\simple_text_io] > cl /nologo- 2>&1 | find "++" Microsoft (R) C/C++ Optimizing Compiler Version 19.00.23026 for x86 [H:\scratchpad\simple_text_io] > cl compiler_bug_demo.cpp /Feb compiler_bug_demo.cpp compiler_bug_demo.cpp(8): warning C4566: character represented by universal-character-name '\U00010437' cannot be represented in the current code page (1252) [H:\scratchpad\simple_text_io] > run b Process exit code = 1. [H:\scratchpad\simple_text_io] > _
Is there any UB involved, and if not, which compiler behaves correctly?
Addendum:
The behavior is unchanged for both compilers if use lowercase greek PI, “π”, which is in the BMP, so that doesn't seem to matter.
From [lex.string]:
- In translation phase 6, adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. [ Note: This concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a literal has been translated into a value from the appropriate character set), a string literal’s initial rawness has no effect on the interpretation or well-formedness of the concatenation. —end note ] Table 8 has some examples of valid concatenations.
So there is no UB here, however phase 5 of translation might have already changed values of some characters:
- Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With