Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find a table of all the characters for every C99 Character Set?

I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:

  • Basic Character Set
  • Basic Execution Character Set
  • Basic Source Character Set
  • Execution Character Set
  • Extended Character Set
  • Source Character Set

C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.

The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:

52 upper- and lower-case letters in the Latin alphabet:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

a b c d e f g h i j k l m n o p q r s t u v w x y z

Ten decimal digits:

0 1 2 3 4 5 6 7 8 9

29 graphic characters:

! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~

4 whitespace characters:

space, horizontal tab, vertical tab, form feed

I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.

Thanks for any help you can offer! :)

like image 470
Dave Avatar asked Oct 11 '10 00:10

Dave


People also ask

How do you find the character set?

Answer. The database character set value of an Oracle database can be determined by running the following command in Oracle's SQL*Plus or PDSQL: select * from NLS_DATABASE_PARAMETERS where parameter='NLS_CHARACTERSET';

How do I find the character set in SQL?

To see the default character set and collation for a given database, use these statements: USE db_name; SELECT @@character_set_database, @@collation_database; Alternatively, to display the values without changing the default database: SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.

What is a ASCII table?

The ASCII table contains letters, numbers, control characters, and other symbols. Each character is assigned a unique 7-bit code. ASCII is an acronym for American Standard Code for Information Interchange.


2 Answers

Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:

§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made

§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents

§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

To get a list of the encodings supported by iconv, run iconv -l. My system has 143 different encodings to choose from.

like image 190
Adam Rosenfield Avatar answered Nov 15 '22 19:11

Adam Rosenfield


As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).

You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.

Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).

The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.

As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.

The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).

like image 25
Michael Burr Avatar answered Nov 15 '22 20:11

Michael Burr