I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets: <ul> <li>Basic Character Set</li> <li>Basic Execution Character Set</li> <li>Basic Source Character Set</li> <li>Execution Character Set</li> <li>Extended Character Set</li> <li>Source Character Set</li> </ul> C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail. The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set: <blockquote> 52 upper- and lower-case letters in the Latin alphabet: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z Ten decimal digits: 0 1 2 3 4 5 6 7 8 9 29 graphic characters: ! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~ 4 whitespace characters: space, horizontal tab, vertical tab, form feed </blockquote> I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me. Thanks for any help you can offer! :)

Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are: <blockquote> §3.4.1 implementation-defined behavior unspecified behavior where each implementation documents how the choice is made §3.4.2 locale-specific behavior behavior that depends on local conventions of nationality, culture, and language that each implementation documents §5.2.1.1 Character sets Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined. </blockquote> So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state: <pre class="prettyprint"> -fexec-charset=charset Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's "iconv" library routine. -fwide-exec-charset=charset Set the wide execution character set, used for wide string and character constants. The default is UTF-32 or UTF-16, whichever corresponds to the width of "wchar_t". As with -fexec-charset, charset can be any encoding supported by the system's "iconv" library routine; however, you will have problems with encodings that do not fit exactly in "wchar_t". -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine. </pre> To get a list of the encodings supported by <code>iconv</code>, run <code>iconv -l</code>. My system has 143 different encodings to choose from.

Where can I find a table of all the characters for every C99 Character Set?

Tags:

c

character-encoding

c99

I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:

Basic Character Set
Basic Execution Character Set
Basic Source Character Set
Execution Character Set
Extended Character Set
Source Character Set

C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.

The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:

52 upper- and lower-case letters in the Latin alphabet:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

a b c d e f g h i j k l m n o p q r s t u v w x y z

Ten decimal digits:

0 1 2 3 4 5 6 7 8 9

29 graphic characters:

! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~

4 whitespace characters:

space, horizontal tab, vertical tab, form feed

I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.

Thanks for any help you can offer! :)

470

asked Oct 11 '10 00:10

Dave

2 Answers

Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:

§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made

§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents

§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

To get a list of the encodings supported by iconv, run iconv -l. My system has 143 different encodings to choose from.

190

answered Nov 15 '22 19:11

Adam Rosenfield

As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).

You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.

Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).

The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.

As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.

The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).

answered Nov 15 '22 20:11

Michael Burr

Related questions
                            
                                How to find issues related to Data consistency in an Embedded C code base?
                            
                                What is (INT32_MIN + 1) when int32_t is an extended integer type and int is 32-bit one's complement standard integer type
                            
                                reuse of variadic arguments
                            
                                Is it possible to detect conflicting use of reserved identifiers in C?
                            
                                Is it well-defined to use memset on a dynamic bool array?
                            
                                The purpose of wrapping a pointer in struct in C
                            
                                Linux process stack overrun by local variables (stack guarding)
                            
                                C - Conversion behavior between two pointers
                            
                                Why does global variable definition in C header file work? [duplicate]
                            
                                How to avoid redefining VERSION, PACKAGE, etc
                            
                                Timekeeping in Linux kernel 2.6
                            
                                What's the difference between the -symbolic and -shared GCC flags?
                            
                                How to combine shared libraries?
                            
                                Utilizing the LDT (Local Descriptor Table)
                            
                                Delete a file in C
                            
                                Cross-platform redirect of standard input and output of spawned process in native C/C++ (edit with solution)
                            
                                Wrapping malloc - C
                            
                                Locking mechanisms for shared-memory consistency
                            
                                modified depth first traversal of tree
                            
                                Is unsigned char a[4][5]; a[1][7]; undefined behavior?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With