Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the behaviour of subtracting characters implementation specific?

This statement:

if('z' - 'a' == 25)

is not guaranteed to evaluate in the same way. It is compiler dependent. Also, it is not guaranteed to be evaluated in the same way as this:

#if 'z' - 'a' == 25

even if both the preprocessor and compiler are run on the same machine. Why is that?

like image 380
wildpointerxx Avatar asked Oct 23 '17 13:10

wildpointerxx


3 Answers

The OP is asking about a direct quote from the standard — N1570 §6.10.1p3,4 + footnote 168:

... the controlling constant expression is evaluated according to the rules of 6.6. ... This includes interpreting character constants, which may involve converting escape sequences into execution character set members. Whether the numeric value for these character constants matches the value obtained when an identical character constant occurs in an expression (other than within a #if or #elif directive) is implementation-defined.168

[footnote 168] Thus, the constant expression in the following #if directive and if statement is not guaranteed to evaluate to the same value in these two contexts.

#if 'z' - 'a' == 25
if ('z' - 'a' == 25)

So, yes, it really isn't guaranteed.

To understand why it isn't guaranteed, first you need to know that the C standard doesn't require the character constants 'a' and 'z' to have the numeric values assigned to those characters by ASCII. Most C implementations nowadays use ASCII or a superset, but there is another encoding called EBCDIC that is still widely used (only on IBM mainframes, but there are still a lot of those out there). In EBCDIC, not only do 'a' and 'z' have different values from ASCII, the alphabet isn't a contiguous sequence! That's why the expression 'z' - 'a' == 25 might not evaluate true in the first place.

You also need to know that the C standard tries to maintain a distinction between the text encoding used for source code (the "source character set") and the text encoding that the program will use at runtime (the "execution character set"). This is so you can, at least in principle, take a program whose source encoded in ASCII text and run it unmodified on a computer that uses EBCDIC, just by cross-compiling appropriately; you don't have to convert the source text to EBCDIC first.

Now, the compiler has to understand both character sets if they're different, but historically, the C preprocessor (translation phases 1 through 4) and the "compiler proper" (phases 5 through 7) were two separate programs, and #if expressions are the only place where the preprocessor would have to know about the execution character set. So, by making it implementation-defined whether the "execution character set" used by the preprocessor matches that used by the compiler proper, the standard licenses the preprocessor to do all its work in the source character set, making life a little bit easier back in 1989.

Having said all that, I would be very surprised to find a modern compiler that didn't make both expressions evaluate to the same value, even when the execution and source character sets are grossly incompatible. Modern compilers tend to have integrated preprocessors -- phases 1 through 7 are all carried out by the same program -- and even if they don't, the engineering burden of specializing the preprocessor to match its execution character set to the compiler proper is trivial nowadays.

like image 164
zwol Avatar answered Nov 13 '22 20:11

zwol


Because not all computers use ascii or unicode.

In the past, a standard called ebcdic was common. In ebcdic 500 the value of 'z' is 169 and the value of 'a' is 130. The expression 'z'-'a' would then evaluate to 39.

This explains why you cannot assume a certain value for an expression of the type 'a' or even 'z'-'a'. However, it does not explain why the two expressions in the Q are not guaranteed to be equal.

The preprocessor and the compiler are two different things. The preprocessor deals with the encoding used in the source code, while the compiler targets the machine you are compiling for. See zwol's answer for a more elaborate explanation.

like image 38
klutt Avatar answered Nov 13 '22 21:11

klutt


To expand on the other correct answers, a real-world example of a non-ASCII C compiler that’s still being used is IBM’s z/OS XL C/C++. By default, it assumes that source files are in IBM code page 1047 (the version of EBCDIC with the same repertoire as Latin-1). However, it has several different compiler options to support not only ASCII, but also “hybrid code,” or source files containing data in more than one encoding. (These programs exist because MVS compilers required syntax statements to be in IBM-1047 encoding only.)

From the documentation, it looks like it would be possible to muck around with commands like #pragma CONVLIT(suspend) in a way that really would make those two statements evaluate differently on that compiler. I don’t have a copy to test a MCVE on.

like image 7
Davislor Avatar answered Nov 13 '22 21:11

Davislor