At @Zaibis suggestion (and related to my own answer to What are the valid characters for macro names?, as well as π (and other unicode characters) in identifiers not allowed by g++))...
clang allows a lot of "crazy" characters.. although I have struggled to find much rhyme or reason - as to why some are allowed (π΄ Ο γ β β Β½), and others are not (βΆοΈ β β Β©).
For example, the following all compile A-OK (clang-700.1.76
)
#define π© ?: // OK (Pile of poo) #define οΏ @end // OK (HALFWIDTH BLACK SQUARE) #define π
Ί @interface // OK (NEGATIVE SQUARED LATIN CAPITAL LETTER K) #define οΌ° @protocol // OK (FULLWIDTH LATIN CAPITAL LETTER P)
yet the following all result in the same compiler error...
Macro name must be an identifier.
#define β TEL #define β NO #define β§ UP #define γ == #define π APPLE
clang
's docs refer to the issue, stating only...
... support for extended identifiers in C99 and C++. This feature allows identifiers to contain certain Unicode characters, as specified by the active language standard; these characters can be written directly in the source file using the UTF-8 encoding, or referred to using universal character names (\u00E0, \U000000E0).
So, I guess I'm asking.. what IS the "active language standard", and how can I find an authoritative source for what identifiers are legal.
I created the following code just to see what clang
would do with it. Out of about 63488 possible identifiers tested, 23 issued warnings and 9506 generated errors. That leaves almost 54,000 valid characters to use in identifiers. Certainly enough, but who got cut? And why?
C identifiers represent the name in the C program, for example, variables, functions, arrays, structures, unions, labels, etc. An identifier can be composed of letters such as uppercase, lowercase letters, underscore, digits, but the starting letter should be either an alphabet or an underscore.
A variable name can only have letters (both uppercase and lowercase letters), digits and underscore. The first letter of a variable should be either a letter or an underscore. There is no rule on how long a variable name (identifier) can be.
Only alphabetic characters, numeric digits, and the underscore character (_) are legal in an identifier.
A valid identifier must have characters [A-Z] or [a-z] or numbers [0-9], and underscore(_) or a dollar sign ($). for example, @javatpoint is not a valid identifier because it contains a special character which is @. There should not be any space in an identifier. For example, java tpoint is an invalid identifier.
Check whether the given string is a valid identifier. It must start with either underscore (_) or any of the characters from the ranges [βaβ, βzβ] and [βAβ, βZβ]. There must not be any white space in the string. And, all the subsequent characters after the first character must not consist of any ...
... support for extended identifiers in C99 and C++. This feature allows identifiers to contain certain Unicode characters, as specified by the active language standard; these characters can be written directly in the source file using the UTF-8 encoding, or referred to using universal character names (\u00E0, \U000000E0).
These provisions hold that every identifier may contain underscores, upper- and lower-case Latin letters, decimal digits, sequences of characters constituting "universal character names" (subject to limitations), and any other character defined by the implementation.
Authoritative source for legal identifier characters is the C11 standard ISO/IEC 9899:2011 (See Annex D). This list is based on a technical report, ISO/IEC TR 10176, but with modifications. Show activity on this post.
As others have mentioned, Annex D of ISO/IEC 9899:2011 lists the hexadecimal values of characters valid for universal character names in C11. (I won't bother repeating it here.) I have been searching for an answer as to "why" this list was chosen.
First, there are two relevant standards defining a set of characters: ISO/IEC 10646 (defining UCS) and Unicode. To further confuse (or simplify) things, they both define the same characters since the ISO and Unicode keep them synchronized. UCS is essentially just a character map associating values to a set of characters ("repertoire"), while Unicode also gives further definitions such how to compare strings in an alphabetical sorting order (collation), which code points represent "canonically equivalent" characters (normalization), and a bidirectional algorithm for how to process characters in languages written right to left, and more.
Universal character names (UCN) was a feature newly added in C99 (ISO/IEC 9899:1999). In the "Rationale for International Standard---Programming Languages---C" (Rev. 2, Oct. 1999), the purpose was "to enable the use of any 'native' character in identifiers, string literals and character constants, while retaining the portability objective of C" (sec. 5.2.1). This section continues on about issues of how to encode these characters in C (the \U
and \u
forms versus multibyte characters or native encodings) and policy models of how to deal with it (p.14, see PDF page 22).
I was hoping that the same "rationale" document from 1999 would give a reason of why each extended character range was selected as acceptable for C99's UCNs. The entirety of the rationale's Annex I is:
Annex I Universal character names for identifiers (normative)
A new feature of C9X.
This is not much of a rationale. They didn't even know what year the C standard would be published, so it's just called "C9X". A later rationale document from 2003 is slightly more enlightening:
Annex D Universal character names for identifiers (normative)
New feature for C99.
The intention is to keep current with ISO/IEC TR 10176.
ISO/IEC TR 10176 is "Guidelines for the preparation of programming language standards." It a basically a guidebook for people who write programming language standards. It includes guidelines for the use of character sets in programming languages as well as a "recommended extended repertoire for user-defined identifiers" (Annex A). But this quote from the 2003 rationale document is only an "intention to keep current," not a pledge of strict adherence to TR 10176.
There is a publicly available ISO/IEC TR 10176:2003 table of characters. The character values refer to ISO 10646. The table classifies ranges of characters from numerous languages as being "uppercase" Lu
; "lowercase" Ll
; "number, decimal digit" Nd
, "punctuation, connector" Pc
; etc. It should be clear what use such classifications have to a programming language.
An important reminder is that TR 10176 is a Technical Report, and not a standard. I have found several passing references to it on forums and in documents related to other programming languages, such as Ada, COBOL, and D language. Much of the discussion was about how closely standards of those languages should follow TR 10176 (not being a standard) and complaints that TR 10176 was lagging behind updates to ISO 10646.
Perhaps most enlightening is document WG21/N3146: "Recommendations for extended identifier characters for C and C++." It starts with a comment in 2010 to the standards body recommending restrictions on the initial characters of identifiers. It mentions similar complaints about C referencing TR 10176, and makes suggestions about what characters should be allowed as initial characters of an identifier based on restrictions from Unicode's Identifier and Pattern Syntax and XML's Common Syntactic Constructs. WG21/N3146 gives the proposed wording that later appeared in the C11 standard ISO/IEC 9899:2011. There is a table at the end of the document that helps shed light on the character ranges selected.
Below is a compiled list of ranges for extended identifier characters. The boldface ranges are those given in C11 (ISO/IEC 9899:2011 Annex D). Some comments are added about the italicized ranges not listed in C11 (i.e. not allowed). They are either marked in WG21/N3146 as disallowed by Unicode's UAX#31 or XML's Common Syntactic Constructs, or prohibited by some other comment.
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00C0-00D6, 00D8-00F6, 00F8-00FF: (Various characters, such as feminine Βͺ and masculine ΒΊ ordinal indicators, vowels with diacritics, numeric characters such as superscript numbers, fractions, etc.)
(previous gaps): All disallowed by UAX31 and/or XML. (Generally punctuation type marks like «», monetary symbols Β₯Β£, mathematical operators ΓΓ·, etc.)
0100-167F: (Latin, Greek, Cyrillic, Arabic, Thai, Ethiopic, etc.---many others)
1680: "The Ogham block contains a script-specific space: α"
1681-180D: (Ogham, Tagalog, Mongolian, etc.)
180E: "The Mongolian block contains a script-specific space"
180F-1FFF: (More languages... phonetics, extended Latin & Greek, etc.)
2000: starts the "General Punctuation" block, but some are allowed:
200Bβ200D, 202Aβ202E, 203Fβ2040, 2054, 2060β206F: (selections from "General Punctuation" block)
2070β218F: "Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols, Letterlike Symbols, Number Forms"
2190-245F: "Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition"
2460-24FF: "Enclosed Alphanumerics"
2500: starts "Box Drawing, Block Elements, Geometric Shapes", etc.
2776-2793: (some dingbats and circled dingbats)
2794-2BFF: (a different dingbat set, mathematical symbols, arrows, Braille patterns, etc.)
2C00-2DFF, 2E80-2FFF: "Glagolitic, Latin Extended-C, Coptic, Georgian Supplement, Tifinagh, Ethiopic Extended, Cyrillic Extended-A" (also CJK radical supplement)
3000: (start of "CJK Symbols and Punctuation", some selections allowed)
3004-3007, 3021-302F, 3031-303F: (allowed "CJK Symbols and Punctuation")
3040-D7FF: "Hiragana, Katakana," more CJK ideograms, radicals, etc.
D800-F8FF: (This starts the High and Low Surrogate Areas (number space needed for encodings), and Private Use)
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD: selections from "CJK Compatibility Ideographs," "Arabic Presentation Forms," etc. 10000β1FFFD, 20000β2FFFD, 30000β3FFFD, 40000β4FFFD, 50000β5FFFD, 60000β6FFFD, 70000β7FFFD, 80000β8FFFD, 90000β9FFFD, A0000βAFFFD, B0000βBFFFD, C0000βCFFFD, D0000βDFFFD, E0000βEFFFD: WG21/N3146 gives the rationale for these final ranges:
The Supplementary Private Use Area extends from F0000 through 10FFFF; both [AltId] and [XML2008] disallow characters in that range.
In addition, [AltId] disallows, as non-characters, the last two code positions of each plane, i.e. every position of the form PFFFE or PFFFF, for any value of P.
The "Ranges of characters disallowed initially" from C11 Annex D.2 are 0300β036F, 1DC0β1DFF, 20D0β20FF, FE20βFE2F.
With this WG21/N3146 placed next to the Annex D of the C11 standard, much can be inferred about how they line up. For example, mathematical operators and punctuation seem to be not allowed. I hope this sheds some light on "why" or "how" the allowed characters were chosen.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With