I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:
C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.
The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:
52 upper- and lower-case letters in the Latin alphabet:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
Ten decimal digits:
0 1 2 3 4 5 6 7 8 9
29 graphic characters:
! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~
4 whitespace characters:
space, horizontal tab, vertical tab, form feed
I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.
Thanks for any help you can offer! :)
Answer. The database character set value of an Oracle database can be determined by running the following command in Oracle's SQL*Plus or PDSQL: select * from NLS_DATABASE_PARAMETERS where parameter='NLS_CHARACTERSET';
To see the default character set and collation for a given database, use these statements: USE db_name; SELECT @@character_set_database, @@collation_database; Alternatively, to display the values without changing the default database: SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.
The ASCII table contains letters, numbers, control characters, and other symbols. Each character is assigned a unique 7-bit code. ASCII is an acronym for American Standard Code for Information Interchange.
Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:
§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:
-fexec-charset=charset Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's "iconv" library routine. -fwide-exec-charset=charset Set the wide execution character set, used for wide string and character constants. The default is UTF-32 or UTF-16, whichever corresponds to the width of "wchar_t". As with -fexec-charset, charset can be any encoding supported by the system's "iconv" library routine; however, you will have problems with encodings that do not fit exactly in "wchar_t". -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.
To get a list of the encodings supported by iconv
, run iconv -l
. My system has 143 different encodings to choose from.
As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).
You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.
Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).
The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.
As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.
The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With