Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using "umlauts" in C++ code [duplicate]

Possible Duplicate:
C++ source in unicode

I just discovered this line of code in a project:

string überwachung;

I was surprised, because actually I thought you are not allowed to use umlauts like 'äöü' in C++ code other than in strings and so on, and it would result in a compiler error. But this compiles just fine with visual studio 2008.

  • Is this a special microsoft feature, or are umlauts allowed with other compilers too?
  • Are there any potential problems with that (portability,system language settings..)?
  • I can clearly remember this was not allowed. When did it change?

Kind regards for any clarification

P.S.: the tool cppcheck will even mark this usage as an error, even though it compiles

like image 791
nabulke Avatar asked Apr 12 '11 14:04

nabulke


3 Answers

GCC complains on it: codepad

: error: stray '\303' in program

The C++ language standard itself limits the basic source character set to 91 printable characters plus tabs, form feed and new-line, which are all within ASCII. However, there's a nice footnote:

The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

.. translation phase 1 is (emphasis mine)

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined.

Generally, you shouldn't use umlauts or other special characters in your code. If may work, but if it does, it's a compiler-specific feature.

like image 62
Alexander Gessler Avatar answered Nov 18 '22 02:11

Alexander Gessler


See section E/2 of the C++03 standard:

1 This clause lists the complete set of hexadecimal code values that are valid in universal-character-names in C++ identifiers (2.10).

Latin: 00c0–00d6, 00d8–00f6, 00f8–01f5, 01fa–0217, 0250–02a8, 1e00–1e9a, 1ea0–1ef9

This includes most accented letters.

The problem is that C++03 didn't specify UTF-8 as the input format. Even C++11 maintains compatibility with EBCDIC.

So, you can certainly create an identifier with an umlaut; the problem is getting a text editor that will interpret the universal-character-name and display it properly. Otherwise you're stuck inputting Unicode directly in hexadecimal format \uXXXX, e.g. \u00FC for ü.

A compiler which accepts UTF-8 in string constants but not in identifiers suffers from shortsighted implementation. Clang, at least, properly translates UTF-8 to universal-character-names in Phase 1.

like image 4
Potatoswatter Avatar answered Nov 18 '22 01:11

Potatoswatter


I believe this is the clause that applies...

2.2 Character Sets

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ˆ & | ˜ ! = , \ " ’

So the use of the unlaut would appear to be a compiler-specific extension.

like image 2
John Dibling Avatar answered Nov 18 '22 02:11

John Dibling