The C programming language says that the functions from <code><ctype.h></code> follow a common requirement: ISO C99, 7.4p1: <blockquote> In all cases the argument is an <code>int</code>, the value of which shall be representable as an <code>unsigned char</code> or shall equal the value of the macro <code>EOF</code>. If the argument has any other value, the behavior is undefined. </blockquote> This means that the following code is unsafe: <pre class="prettyprint"><code>int upper(const char *s, size_t index) { return toupper(s[index]); } </code></pre> If this code is executed on an implementation where <code>char</code> has the same value space as <code>signed char</code> and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is: <pre class="prettyprint"><code>int upper(const char *s, size_t index) { return toupper((unsigned char) s[index]); } </code></pre> Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong? [Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]

For what it's worth, the Solaris Studio compilers (using <code>stlport4</code>) are one such compiler suite that produce an unexpected result here. Compiling and running this: <pre class="prettyprint"><code>#include <stdio.h> #include <cctype> int main() { char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8 printf("is whitespace: %i\n", std::isspace(ch)); return 0; } </code></pre> gives me: <pre class="prettyprint"><code>kevin@solaris:~/scratch $ CC -library=stlport4 whitespace.cpp && ./a.out is whitespace: 8 </code></pre> For reference: <pre class="prettyprint"><code>$ CC -V CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31 </code></pre> Of course, this behavior is as documented in the C++ standard, but it's definitely surprising. <hr> EDIT: Since it was pointed out that the above version contained undefined behavior in the attempt to assign <code>char ch = '\xa1'</code> due to integer overflow, here's a version that avoids that and still retains the same output: <pre class="prettyprint"><code>#include <stdio.h> #include <cctype> int main() { char ch = -95; printf("is whitespace: %i\n", std::isspace(ch)); return 0; } </code></pre> And that does still print 8 on my Solaris VM: <pre class="prettyprint"><code>kevin@solaris:~/scratch $ CC -library=stlport4 whitespace.cpp && ./a.out is whitespace: 8 </code></pre> <hr> EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of <code>std::isspace()</code>: <pre class="prettyprint"><code>#include <cstdio> #include <cstring> #include <cctype> static int count_whitespace(const char* str, int n) { int count = 0; for (int i = 0; i < n; i++) if (std::isspace(str[i])) // oops! count += 1; return count; } int main() { const char* batman = "I am batman\xa1"; int n = std::strlen(batman); std::printf("%i\n", count_whitespace(batman, n)); return 0; } </code></pre> And, on my Solaris machine: <pre class="prettyprint"><code>kevin@solaris:~/scratch $ CC whitespace.cpp && ./a.out 3 </code></pre> Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster. You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting <code>str[i]</code> to <code>unsigned char</code>.

Is it safe to call the functions from <cctype> with char arguments?

Tags:

c++

c

language-lawyer

undefined-behavior

character

The C programming language says that the functions from <ctype.h> follow a common requirement:

ISO C99, 7.4p1:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

This means that the following code is unsafe:

int upper(const char *s, size_t index) {
  return toupper(s[index]);
}

If this code is executed on an implementation where char has the same value space as signed char and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:

int upper(const char *s, size_t index) {
  return toupper((unsigned char) s[index]);
}

Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?

[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]

495

asked Aug 20 '11 09:08

Roland Illig

2 Answers

For what it's worth, the Solaris Studio compilers (using stlport4) are one such compiler suite that produce an unexpected result here. Compiling and running this:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

gives me:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

For reference:

$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31

Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.

EDIT: Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1' due to integer overflow, here's a version that avoids that and still retains the same output:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = -95;
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

And that does still print 8 on my Solaris VM:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace():

#include <cstdio>
#include <cstring>
#include <cctype>

static int count_whitespace(const char* str, int n) {
    int count = 0;
    for (int i = 0; i < n; i++)
        if (std::isspace(str[i]))  // oops!
            count += 1;
    return count;
}

int main() {
    const char* batman = "I am batman\xa1";
    int n = std::strlen(batman);
    std::printf("%i\n", count_whitespace(batman, n));
    return 0;
}

And, on my Solaris machine:

kevin@solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3

Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.

You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i] to unsigned char.

165

answered Oct 13 '22 00:10

Kevin Ushey

Sometimes most people are wrong. I think that's so here. Having said that there's nothing to stop an standard library implementor defining the behaviour that most people expect. So maybe that's why most people don't care, since they've never actually seen a bug resulting from this error.

answered Oct 13 '22 00:10

john

Related questions
                            
                                How to setup a makefile in eclipse (C++)?
                            
                                Declaring variables inside C switch/case
                            
                                Implicit conversion for pointer to data member vs. non-member
                            
                                Qt "Creating SSL context" error in few computers
                            
                                What's the best way to return something like a collection of `std::auto_ptr`s in C++03?
                            
                                How can I draw an animation on a transparent window using Windows API?
                            
                                C++0x | Why std::atomic overloads each method with the volatile-qualifier?
                            
                                Refactoring: Making a game engine more modular and how
                            
                                Script for separating implementation from headers in a .h file
                            
                                "This application has requested the Runtime to terminate it in an unusual way."
                            
                                Draw on webcam using OpenCV
                            
                                Mediator C++ GUI sample
                            
                                How does OPENCV calculate eigenvalues and eigenvectors?
                            
                                Win API wrapper classes for handles
                            
                                How to embed resources into a single executable?
                            
                                How to use Emacs and CEDET with SCons?
                            
                                How to perform deep copying of struct with CUDA? [duplicate]
                            
                                Set breakpoint for class member function not successful
                            
                                Qt's best way to display very large rich text?
                            
                                boost::interprocess memory allocator on anonymous segment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With