Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a simple, portable way to determine the ordering of two characters in C?

According to the standard:

The values of the members of the execution character set are implementation-defined.
(ISO/IEC 9899:1999 5.2.1/1)

Further in the standard:

...the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
(ISO/IEC 9899:1999 5.2.1/3)

It appears that the standard requires that the execution character set includes the 26 uppercase and 26 lowercase letters of the Latin alphabet, but I see no requirement that these characters be ordered in any way. I only see an order stipulation for the decimal digits.

This would seem to imply that, strictly speaking, there is no guarantee that 'a' < 'b'. Now, the letters of the alphabet are in order in each of ASCII, UTF-8, and EBCDIC. But for ASCII and UTF-8 we have 'A' < 'a', while for EBCDIC we have 'a' < 'A'.

It might be nice to have a function in ctype.h that compares alphabetic characters portably. Short of this or something similar, it seems to me that one must look in the locale to find the value of CODESET and proceed accordingly, but this doesn't seem simple.

My gut tells me that this is almost never an issue; for most cases alphabetical characters can be handled by converting to lowercase, because for the most commonly used character sets the letters are in order.

The question: given two chars

char c1;
char c2;

is there a simple, portable way to determine if c1 precedes c2 alphabetically? Or do we assume that the lowercase and uppercase characters always occur in sequence, even though this does not appear to be guaranteed by the standard?

To clarify any confusion, I am really just interested in the 52 letters of the Latin alphabet that are guaranteed by the standard to be in the execution character set. I realize that other sets of letters are important, but it seems that we can't even know about the ordering of this small subset of letters.

Edit

I think that I need to clarify a bit more. The issue, as I see it, is that we commonly think of the 26 lowercase letters of the Latin alphabet as being ordered. I would like to be able to assert that 'a' comes before 'b', and we have a convenient way of expressing this in code as 'a' < 'b', when we give 'a' and 'b' integral values. But the standard gives no assurances that the above code will evaluate as expected. Why not? The standard does guarantee this behavior for the digits 0-9, and this seems sensible. If I want to determine if one letter-char precedes another, say for sorting purposes, and if I want this code to be truly portable, it seems like the standard offers no help. Now I have to rely on the convention that ASCII, UTF-8, EBCDIC, etc. have adopted that 'a' < 'b' should be true. But this isn't really portable unless the only character sets used rely on this convention; this may be true.

This question originated for me in another question thread: Check if a letter is before or after another letter in C. Here, a few people suggested that you could determine the order of two letters stored in chars using inequalities. But one commenter pointed out that this behavior is not guaranteed by the standard.

like image 479
ad absurdum Avatar asked Oct 07 '16 19:10

ad absurdum


2 Answers

strcoll is designed for this purpose. Simply set up two strings of one character each. (normally you want to compare strings, not characters).

like image 127
Malcolm McLean Avatar answered Nov 15 '22 19:11

Malcolm McLean


There are historically used codes that don't simply order the alphabet. Baudot, for example, puts vowels before consonants, so 'A' < 'B', but 'U' < 'B' as well.

There are also codes like EBCDIC that are ordered, but with gaps. So in EBCDIC, 'I' < 'J', but 'I' + 1 != 'J'.

like image 36
Lee Daniel Crocker Avatar answered Nov 15 '22 20:11

Lee Daniel Crocker