Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it safe to call the functions from <cctype> with char arguments?

The C programming language says that the functions from <ctype.h> follow a common requirement:

ISO C99, 7.4p1:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

This means that the following code is unsafe:

int upper(const char *s, size_t index) {
  return toupper(s[index]);
}

If this code is executed on an implementation where char has the same value space as signed char and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:

int upper(const char *s, size_t index) {
  return toupper((unsigned char) s[index]);
}

Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?

[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]

like image 495
Roland Illig Avatar asked Aug 20 '11 09:08

Roland Illig


People also ask

What does cctype mean in C++?

The C++ <cctype> header file declares a set of functions to classify (and transform) individual characters.

Why is the argument type int in all the character handling function?

In all cases the argument is an int , the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF . If the argument has any other value, the behavior is undefined.

Why do we use Ctype h in C++?

h header file contains inbuilt functions to handle Strings in C/C++, the ctype. h/<cctype> contains inbuilt functions to handle characters in C/C++ respectively. Characters are of two types: Printable Characters: The characters that are displayed on the terminal.

Are char and int interchangeable?

An int is required to be at least a 16 bits signed word, and to accept all values between -32767 and 32767. That means that an int can accept all values from a char, be the latter signed or unsigned.


2 Answers

For what it's worth, the Solaris Studio compilers (using stlport4) are one such compiler suite that produce an unexpected result here. Compiling and running this:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

gives me:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

For reference:

$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31

Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.


EDIT: Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1' due to integer overflow, here's a version that avoids that and still retains the same output:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = -95;
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

And that does still print 8 on my Solaris VM:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace():

#include <cstdio>
#include <cstring>
#include <cctype>

static int count_whitespace(const char* str, int n) {
    int count = 0;
    for (int i = 0; i < n; i++)
        if (std::isspace(str[i]))  // oops!
            count += 1;
    return count;
}

int main() {
    const char* batman = "I am batman\xa1";
    int n = std::strlen(batman);
    std::printf("%i\n", count_whitespace(batman, n));
    return 0;
}

And, on my Solaris machine:

kevin@solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3

Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.

You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i] to unsigned char.

like image 165
Kevin Ushey Avatar answered Oct 13 '22 00:10

Kevin Ushey


Sometimes most people are wrong. I think that's so here. Having said that there's nothing to stop an standard library implementor defining the behaviour that most people expect. So maybe that's why most people don't care, since they've never actually seen a bug resulting from this error.

like image 42
john Avatar answered Oct 13 '22 00:10

john