Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Semantics of comparison of char objects

While I was reading through some old code today, I noticed the following assert line:

assert(('0' <= hexChar && hexChar <= '9')
    || ('A' <= hexChar && hexChar <= 'F')
    || ('a' <= hexChar && hexChar <= 'f'));

The purpose is to assert that hexChar is a hexadecimal digit ([0-9A-Fa-f]). It does this by relying on an ASCII-like ordering of char objects representing 'A', 'B', ..., 'F' and 'a', 'b', ..., 'f'.

I began wondering whether this always does what I intended, given that the execution character set is implementation-defined.

The C++ standard in Section 2.3, Character sets, mentions:

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

I interpret this to mean that ('0' <= hexChar && hexChar <= '9') is okay because '0', '1', ..., '9' are digits and each has a value one greater than the previous. However, the order of other basic source characters with respect to one another is still implementation-defined.

Is this a correct statement? Knowing nothing about the C++ compiler (so not knowing the implementation details), do I need to rewrite the assert as the following?

assert(('0' <= hexChar && hexChar <= '9')
    || ('A' == hexChar || 'B' == hexChar || 'C' == hexChar || 'D' == hexChar || 'E' == hexChar || 'F' == hexChar)
    || ('a' == hexChar || 'b' == hexChar || 'c' == hexChar || 'd' == hexChar || 'e' == hexChar || 'f' == hexChar));
like image 249
Daniel Trebbien Avatar asked Dec 19 '10 16:12

Daniel Trebbien


3 Answers

The first line, comparison against the values of '0' and '9' is 100% portable. It's guaranteed by the C language to behave identically for all implementations.

The second and third lines are in principle implementation-defined, but there has never been, and never will be, an implementation where their behavior differs. The only non-ISO646-compatible character encoding that has ever been used with the C language (and the only reason C allows non-ISO646-compatible encodings) is EBCDIC, which places the letters 'A' through 'F' exactly where they should fall as hexadecimal values (in general the letters are discontiguous in EBCDIC, but A-F are one contiguous group).

With that said, unless you need to support legacy mainframes, there is no value in trying to handle basic character encoding "portably" in C. char is 8 bits, the values 0-127 are ASCII, and the values 128-255 are part of a locale- or data-specific multibyte character encoding which we'll someday be able to assume is always UTF-8.

like image 196
R.. GitHub STOP HELPING ICE Avatar answered Oct 09 '22 11:10

R.. GitHub STOP HELPING ICE


To your first question: yes.

To your second question: perhaps, but probably you should consider using the C library isxdigit function or a C++ locale variant of this.

like image 31
CB Bailey Avatar answered Oct 09 '22 10:10

CB Bailey


Technically, it's entirely legal for a C++ compiler to use some other character encoding. However, the reality is that you almost certainly won't find a platform where this code doesn't work. This is especially true since the new dominant character encodings are Unicode-based, like UTF-16, and Unicode shares all the ASCII values for all characters in the ASCII set. The only reason this is implementation-defined is for very, very old legacy platforms that still existed when this part of the Standard was written- and you'd have to substantially refactor your code to run on any platform that is non-ASCII.

like image 26
Puppy Avatar answered Oct 09 '22 10:10

Puppy