Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

Tags:

c++

c

ascii

I'm looking at the following code to test for a hexadecimal digit and convert it to an integer. The code is kind of clever in that it takes advantage of difference between between capital and lower letters is 32, and that's bit 5. So the code performs one extra OR, but saves one JMP and two CMPs.

static const int BIT_FIVE = (1 << 5); static const char str[] = "0123456789ABCDEFabcdef";  for (unsigned int i = 0; i < COUNTOF(str); i++) {     int digit, ch = str[i];      if (ch >= '0' && ch <= '9')         digit = ch - '0';     else if ((ch |= BIT_FIVE) >= 'a' && ch <= 'f')         digit = ch - 'a' + 10;     ... } 

Do C and C++ guarantee the ASCII or values of [a-f] and [A-F] characters? Here, guarantee means the upper and lower character sets will always differ by a constant value that can be represented by a bit (for the trick above). If not, what does the standard say about them?

(Sorry for the C and C++ tag. I'm interested in both language's position on the subject).

like image 238
jww Avatar asked Apr 01 '15 00:04

jww


1 Answers

No, it does not.

The C standard guarantees that the decimal digits and uppercase and lowercase letters exist, along with a number of other characters. It also guarantees that the decimal digits are contiguous, for example '0' + 9 == '9', and that all members of the basic execution character set have non-negative values. It specifically does not guarantee that the letters are contiguous. (For all the gory details, see the N1570 draft of the C standard, section 5.2.1; the guarantee that basic characters are non-negative is in 6.2.5p3, in the discussion of type char.)

The assumption that 'a' .. 'f' and 'A' .. 'F' have contiguous codes is almost certainly a reasonable one. In ASCII and all ASCII-based character sets, the 26 lowercase letters are contiguous, as are the 26 uppercase letters. Even in EBCDIC, the only significant rival to ASCII, the alphabet as a whole is not contiguous, but the letters 'a' ..'f' and 'A' .. 'F' are (EBCDIC has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J', and between 'R' and 'S').

However, the assumption that setting bit 5 of the representation will convert uppercase letters to lowercase is not valid for EBCDIC. In ASCII, the codes for the lowercase and uppercase letters differ by 32; in EBCDIC they differ by 64.

This kind of bit-twiddling to save an instruction or two might be reasonable in code that's part of the standard library or that's known to be performance-critical. The implicit assumption of an ASCII-based character set should IMHO at least be made explicit by a comment. A 256-element static lookup table would probably be even faster at the expense of a tiny amount of extra storage.

like image 115
Keith Thompson Avatar answered Oct 17 '22 20:10

Keith Thompson