I am reading some code that implements a simple parser. A function named <code>scan</code> breaks up a line into tokens. <code>scan</code> has a static variable <code>bp</code> that is assigned the line to be tokenized. Following the assignment, the whitespace is skipped over. See below. What I don't understand is why the code does a bitwise-and of the character that <code>bp</code> points to with <code>0xff</code>, i.e., what is the purpose of <code>* bp & 0xff</code>? How is this: <pre class="prettyprint"><code>while (isspace(* bp & 0xff)) ++ bp; </code></pre> different from this: <pre class="prettyprint"><code>while (isspace(* bp)) ++ bp; </code></pre> Here is the <code>scan</code> function: <pre class="prettyprint"><code>static enum tokens scan (const char * buf) /* return token = next input symbol */ { static const char * bp; while (isspace(* bp & 0xff)) ++ bp; .. } </code></pre>

From cppreference isspace(): <code>The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF</code>. When <code>*bp</code> is negative, for example it's <code>-42</code>, then it is not representable as <code>unsigned char</code>, because it's negative and <code>unsigned char</code>, well, must be positive or zero. On twos-complement systems values are sign extended to bigger "width", so then they will get left-most bits set. Then when you take <code>0xff</code> of the wider type, the left-most bits are cleared, and you end up with a positive value, lower or equal to <code>0xff</code>, I mean representable as <code>unsigned char</code>. Note that arguments to <code>&</code> undergo implicit promotions, so the result of <code>*bp</code> is converted to <code>int</code> before even calling <code>isspace</code>. Let's assume that <code>*bp = -42</code> for example and assume a sane platform with 8-bit char that is signed and that <code>int</code> has 32-bits, then: <pre class="prettyprint"><code>*bp & 0xff # expand *bp = -42 (char)-42 & 0xff # apply promotion (int)-42 & 0xff # lets convert to hex assuming twos-complement (int)0xffffffd6 & 0xff # do & operation (int)0xd6 # lets convert to decimal 214 # representable as unsigned char, all fine </code></pre> Without the <code>& 0xff</code> the negative value would result in undefined behavior. I would recommend to prefer <code>isspace((unsigned char)*bp)</code>. Basically the simplest <code>isspace</code> implementation looks like just: <pre class="prettyprint"><code>static const char bigarray[257] = { 0,0,0,0,0,...1,0,1,0,... }; // note: EOF is -1 #define isspace(x) (bigarray[(x) + 1]) </code></pre> and in such case you can't pass for example <code>-42</code>, cause <code>bigarray[-41]</code> is just invalid.

Your question: How is this: <pre class="prettyprint"><code>while (isspace(* bp & 0xff)) ++ bp; </code></pre> different from this: <pre class="prettyprint"><code>while (isspace(* bp)) ++ bp; </code></pre> The difference is, in the first example you are always passing the lowermost byte at <code>bp</code> to <code>isspace</code>, due to the result of a bitwise AND with a full bitmask (<code>0b11111111</code> or <code>0xff</code>). It's possible that the argument to <code>isspace</code> contains a type that is larger than 1 byte. For example, <code>isspace</code> is defined as <code>isspace(int c)</code>, so as you can see the argument here is an <code>int</code>, which may be multiple bytes depending on your system. In short, it's a sanity check to ensure that <code>isspace</code> is only comparing a single byte from your <code>bp</code> variable.

Why do a bitwise-and of a character with 0xff?

Tags:

c

char

bitwise-and

integer-promotion

isspace

I am reading some code that implements a simple parser. A function named scan breaks up a line into tokens. scan has a static variable bp that is assigned the line to be tokenized. Following the assignment, the whitespace is skipped over. See below. What I don't understand is why the code does a bitwise-and of the character that bp points to with 0xff, i.e., what is the purpose of * bp & 0xff? How is this:

while (isspace(* bp & 0xff))
    ++ bp;

different from this:

while (isspace(* bp))
    ++ bp;

Here is the scan function:

static enum tokens scan (const char * buf)
                    /* return token = next input symbol */
{   static const char * bp;

    while (isspace(* bp & 0xff))
        ++ bp;

        ..
}

535

asked May 24 '21 19:05

Roger Costello

3 Answers

From the C Standard (7.4 Character handling <ctype.h>)

1 The header <ctype.h> declares several functions useful for classifying and mapping characters.198) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

In this call

isspace(* bp)

the argument expression *bp having the type char is converted to the type int due to the integer promotions.

If the type char behaves as the type signed char and the value of the expression *bp is negative then the value of the promoted expression of the type int is also will be negative and can not be representable as a value of the type unsigned char.

This results in undefined behavior.

In this call

isspace(* bp & 0xff)

due to the bitwise operator & the result value of the expression * bp & 0xff of the type int can be represented as a value of the type unsigned char.

So it is a trick used instead of writing a more clear code like

isspace( ( unsigned char )*bp )

The function isspace is usually implemented such a way that it uses its argument of the type int as an index in a table with 256 values (from 0 to 255). If the argument of the type int has a value that is greater than the maximum value 255 or a negative value (and is not equal to the value of the macro EOF) then the behavior of the function is undefined.

146

answered Oct 16 '22 12:10

Vlad from Moscow

From cppreference isspace(): The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.

When *bp is negative, for example it's -42, then it is not representable as unsigned char, because it's negative and unsigned char, well, must be positive or zero.

On twos-complement systems values are sign extended to bigger "width", so then they will get left-most bits set. Then when you take 0xff of the wider type, the left-most bits are cleared, and you end up with a positive value, lower or equal to 0xff, I mean representable as unsigned char.

Note that arguments to & undergo implicit promotions, so the result of *bp is converted to int before even calling isspace. Let's assume that *bp = -42 for example and assume a sane platform with 8-bit char that is signed and that int has 32-bits, then:

*bp & 0xff               # expand *bp = -42
(char)-42 & 0xff         # apply promotion
(int)-42 & 0xff          # lets convert to hex assuming twos-complement
(int)0xffffffd6 & 0xff   # do & operation
(int)0xd6                # lets convert to decimal
214                      # representable as unsigned char, all fine

Without the & 0xff the negative value would result in undefined behavior.

I would recommend to prefer isspace((unsigned char)*bp).

Basically the simplest isspace implementation looks like just:

static const char bigarray[257] = { 0,0,0,0,0,...1,0,1,0,... };
// note: EOF is -1
#define isspace(x)  (bigarray[(x) + 1])

and in such case you can't pass for example -42, cause bigarray[-41] is just invalid.

answered Oct 16 '22 12:10

KamilCuk

Your question:

How is this:

while (isspace(* bp & 0xff))
    ++ bp;

different from this:

while (isspace(* bp))
    ++ bp;

The difference is, in the first example you are always passing the lowermost byte at bp to isspace, due to the result of a bitwise AND with a full bitmask (0b11111111 or 0xff). It's possible that the argument to isspace contains a type that is larger than 1 byte. For example, isspace is defined as isspace(int c), so as you can see the argument here is an int, which may be multiple bytes depending on your system.

In short, it's a sanity check to ensure that isspace is only comparing a single byte from your bp variable.

answered Oct 16 '22 12:10

h0r53

Related questions
                            
                                Does sleep/nanosleep work by utilizing a busy wait scheme?
                            
                                Child process starts after parent process
                            
                                Getting UTC time as time_t
                            
                                What is the purpose of the sa_data field in a sockaddr?
                            
                                What is %*c%*c in `printf`?
                            
                                How to fill a char array in C
                            
                                Loop over first and last element only
                            
                                Using sscanf to extract an int from a string in C++
                            
                                typedef'ng a pointer and const
                            
                                Linker does not emit multiple definition error when same symbol coexists in object file and static library
                            
                                Const self-referential structures
                            
                                C11 _Generic usage
                            
                                Portable way to serialize float as 32-bit integer
                            
                                How to format a number with thousands separator in C/C++
                            
                                pause() signal handler
                            
                                Converting int to char in C
                            
                                'itoa': The POSIX name for this item is deprecated
                            
                                What is the purpose of restrict as size of array?
                            
                                Three colors triangles
                            
                                Do statically allocated arrays in C use all of their memory even when some of their elements are not specified?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do a bitwise-and of a character with 0xff?

Tags:

c

char

bitwise-and

integer-promotion

isspace

Roger Costello

People also ask

3 Answers

Vlad from Moscow

KamilCuk

h0r53

Recent Activity

Donate For Us