I'm curious if I can compile <pre class="prettyprint"><code>int map [] = { [ /*(unsigned char)*/ 'a' ]=1 }; </code></pre> regardless of platform or if it's better to cast character constants to <code>unsigned char</code> prior to using them as indices.

<blockquote> I'm curious if I can compile <pre class="prettyprint"><code>int map [] = { [ /*(unsigned char)*/ 'a' ]=1 }; </code></pre> regardless of platform or if it's better to cast character constants to unsigned char prior to using them as indices. </blockquote> Your specific code is safe. <code>'a'</code> is an integer character constant. The language specifies of these that <blockquote> An integer character constant has type <code>int</code>. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. [...] If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int. </blockquote> (C2011, paragraph 6.4.4.4/10) It furthermore specifies that <blockquote> If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. </blockquote> (C2011, paragraph 6.2.5/3) and it requires of every implementation that both the basic source and basic execution character sets contain, among other characters, the lowercase Latin letters, including 'a'. (C2011, paragraph 5.2.1/3) You should take care, however: an integer character constant for a character that is not a member of the basic execution character set, including a multibyte character, or for a multi-character integer character constant does need not to be nonnegative. Some of those could, in principle, be negative even on machines where default <code>char</code> is an unsigned type. Moreover, again considering multibyte characters, the cast to <code>unsigned char</code> is not necessarily safe either, in that you could produce collisions that way. To be sure to avoid collisions, you would need to convert to <code>unsigned int</code>, but that could produce much larger arrays than you expect. If you stick to the basic character sets then you're ok. If you stick to single-byte characters then you're ok with a cast. If you must accommodate multibyte characters then for portability, you should probably choose a different approach.

Are character constants always positive?

Tags:

c

language-lawyer

I'm curious if I can compile

int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };

regardless of platform or if it's better to cast character constants to unsigned char prior to using them as indices.

485

asked May 30 '19 19:05

PSkocik

2 Answers

I'm curious if I can compile
int map [] = { [ /*(unsigned char)*/ 'a' ]=1 };
regardless of platform or if it's better to cast character constants to unsigned char prior to using them as indices.

Your specific code is safe.

'a' is an integer character constant. The language specifies of these that

An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. [...] If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.

(C2011, paragraph 6.4.4.4/10)

It furthermore specifies that

If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

(C2011, paragraph 6.2.5/3)

and it requires of every implementation that both the basic source and basic execution character sets contain, among other characters, the lowercase Latin letters, including 'a'. (C2011, paragraph 5.2.1/3)

You should take care, however: an integer character constant for a character that is not a member of the basic execution character set, including a multibyte character, or for a multi-character integer character constant does need not to be nonnegative. Some of those could, in principle, be negative even on machines where default char is an unsigned type.

Moreover, again considering multibyte characters, the cast to unsigned char is not necessarily safe either, in that you could produce collisions that way. To be sure to avoid collisions, you would need to convert to unsigned int, but that could produce much larger arrays than you expect. If you stick to the basic character sets then you're ok. If you stick to single-byte characters then you're ok with a cast. If you must accommodate multibyte characters then for portability, you should probably choose a different approach.

answered Nov 09 '22 14:11

John Bollinger

A character constant is a positive values of int, if it is based on a member of the basic execution-time character set.

Since a is in that basic character set, we know that 'a' is required to be positive.

On the other hand, for example, '\xFF' might not be positive. The FF value will be regarded as the bit pattern for a char^†, which could be signed, giving us a -1 due to two's complement. Similar reasoning will apply if instead of a numeric escape, we use a character that corresponds to a negative value of type char, like characters corresponding to the 0x80-0xFF byte range on 8-bit systems.

It was like this in ANSI C89 and C90, where I'm relying on my memory; but the requirements persist through newer drafts and standards. In the n1570 draft, we have these items:

6.4.4.4 Character Constants, paragraph 10: "If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int."
6.2.5 Types, paragraph 3: "If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative."

A character constant is not a "char object", but the requirements in 6.4.4.4 specify that the value of a character constant is determined using the char representation: "... one that results when an object with type char whose value ...".

_{† The numeric escape sequences for an unprefixed character constants and those prefixed with L have an associated "corresponding type" which is unsigned and are required to be in that type's range (6.4.4.4 9). The idea is that character values are specified as an unsigned value, which gives their bit-wise representation which is then interpreted as char. This intent is also conveyed in Example 2 (6.4.4.4 13)}.

163

answered Nov 09 '22 14:11

Kaz

Related questions
                            
                                Signed bit field represetation
                            
                                recursive datatypes in haskell
                            
                                warning not treated as error with -Wall & -Werror on
                            
                                why didn't gcc decide inline-or-not for me for this function?
                            
                                different size of the same enum type
                            
                                When you define a value in C how does the compiler select the data type
                            
                                About letter f (float type) in C/C++
                            
                                Most efficient way to find the index of the only '1' bit in a char variable (in C)
                            
                                force that part of a c++ compiled as C
                            
                                why C/C++ compiler not always make ++a atomic?
                            
                                C Project - two libraries use same typedef identifier for different types
                            
                                Using pointed to content in assignment of a pointer
                            
                                How to implement lane crossing logical bit-wise shift/rotate (left and right) in AVX2
                            
                                Is there a safe way to specify the value of an object may be uninitialized because it is never used?
                            
                                echo $PATH in system() give me a wrong output [duplicate]
                            
                                How to detect `snprintf` errors?
                            
                                Why memset called after calloc?
                            
                                What is the exact value range for unsigned long?
                            
                                K & R C Variable Names
                            
                                Why does printf specifier format %n not work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With