The C programming language says that the functions from <ctype.h>
follow a common requirement:
ISO C99, 7.4p1:
In all cases the argument is an
int
, the value of which shall be representable as anunsigned char
or shall equal the value of the macroEOF
. If the argument has any other value, the behavior is undefined.
This means that the following code is unsafe:
int upper(const char *s, size_t index) {
return toupper(s[index]);
}
If this code is executed on an implementation where char
has the same value space as signed char
and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:
int upper(const char *s, size_t index) {
return toupper((unsigned char) s[index]);
}
Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?
[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]
The C++ <cctype> header file declares a set of functions to classify (and transform) individual characters.
In all cases the argument is an int , the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF . If the argument has any other value, the behavior is undefined.
h header file contains inbuilt functions to handle Strings in C/C++, the ctype. h/<cctype> contains inbuilt functions to handle characters in C/C++ respectively. Characters are of two types: Printable Characters: The characters that are displayed on the terminal.
An int is required to be at least a 16 bits signed word, and to accept all values between -32767 and 32767. That means that an int can accept all values from a char, be the latter signed or unsigned.
For what it's worth, the Solaris Studio compilers (using stlport4
) are one such compiler suite that produce an unexpected result here. Compiling and running this:
#include <stdio.h>
#include <cctype>
int main() {
char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
printf("is whitespace: %i\n", std::isspace(ch));
return 0;
}
gives me:
kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out
is whitespace: 8
For reference:
$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31
Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.
EDIT: Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1'
due to integer overflow, here's a version that avoids that and still retains the same output:
#include <stdio.h>
#include <cctype>
int main() {
char ch = -95;
printf("is whitespace: %i\n", std::isspace(ch));
return 0;
}
And that does still print 8 on my Solaris VM:
kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out
is whitespace: 8
EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace()
:
#include <cstdio>
#include <cstring>
#include <cctype>
static int count_whitespace(const char* str, int n) {
int count = 0;
for (int i = 0; i < n; i++)
if (std::isspace(str[i])) // oops!
count += 1;
return count;
}
int main() {
const char* batman = "I am batman\xa1";
int n = std::strlen(batman);
std::printf("%i\n", count_whitespace(batman, n));
return 0;
}
And, on my Solaris machine:
kevin@solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3
Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.
You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i]
to unsigned char
.
Sometimes most people are wrong. I think that's so here. Having said that there's nothing to stop an standard library implementor defining the behaviour that most people expect. So maybe that's why most people don't care, since they've never actually seen a bug resulting from this error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With