I am reading some code that implements a simple parser. A function named scan
breaks up a line into tokens. scan
has a static variable bp
that is assigned the line to be tokenized. Following the assignment, the whitespace is skipped over. See below. What I don't understand is why the code does a bitwise-and of the character that bp
points to with 0xff
, i.e., what is the purpose of * bp & 0xff
? How is this:
while (isspace(* bp & 0xff))
++ bp;
different from this:
while (isspace(* bp))
++ bp;
Here is the scan
function:
static enum tokens scan (const char * buf)
/* return token = next input symbol */
{ static const char * bp;
while (isspace(* bp & 0xff))
++ bp;
..
}
In general, the & 0xff operation provides us with a simple way to extract the lowest 8 bits from a number. We can actually use it to extract any 8 bits we need because we can shift right any of the 8 bits we want to be the lowest bits. Then, we can extract them by applying the & 0xff operation.
0xff means "the hexadecimal number ff " - in other words, the integer 255 , which has the binary representation 00000000000000000000000011111111 (when using 32-bit integers). The & operator performs a bitwise AND operation.
So, for example, binary 10000010 represents decimal 130 (128+2) if it's unsigned, but -126 (-128+2) if that same value is signed. Negative one is 0xff, since 64+32+16+8+4+2+1==127.
From the C Standard (7.4 Character handling <ctype.h>)
1 The header <ctype.h> declares several functions useful for classifying and mapping characters.198) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
In this call
isspace(* bp)
the argument expression *bp
having the type char
is converted to the type int
due to the integer promotions.
If the type char
behaves as the type signed char
and the value of the expression *bp
is negative then the value of the promoted expression of the type int
is also will be negative and can not be representable as a value of the type unsigned char
.
This results in undefined behavior.
In this call
isspace(* bp & 0xff)
due to the bitwise operator & the result value of the expression * bp & 0xff
of the type int
can be represented as a value of the type unsigned char
.
So it is a trick used instead of writing a more clear code like
isspace( ( unsigned char )*bp )
The function isspace
is usually implemented such a way that it uses its argument of the type int
as an index in a table with 256 values (from 0 to 255). If the argument of the type int
has a value that is greater than the maximum value 255 or a negative value (and is not equal to the value of the macro EOF) then the behavior of the function is undefined.
From cppreference isspace(): The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF
.
When *bp
is negative, for example it's -42
, then it is not representable as unsigned char
, because it's negative and unsigned char
, well, must be positive or zero.
On twos-complement systems values are sign extended to bigger "width", so then they will get left-most bits set. Then when you take 0xff
of the wider type, the left-most bits are cleared, and you end up with a positive value, lower or equal to 0xff
, I mean representable as unsigned char
.
Note that arguments to &
undergo implicit promotions, so the result of *bp
is converted to int
before even calling isspace
. Let's assume that *bp = -42
for example and assume a sane platform with 8-bit char that is signed and that int
has 32-bits, then:
*bp & 0xff # expand *bp = -42
(char)-42 & 0xff # apply promotion
(int)-42 & 0xff # lets convert to hex assuming twos-complement
(int)0xffffffd6 & 0xff # do & operation
(int)0xd6 # lets convert to decimal
214 # representable as unsigned char, all fine
Without the & 0xff
the negative value would result in undefined behavior.
I would recommend to prefer isspace((unsigned char)*bp)
.
Basically the simplest isspace
implementation looks like just:
static const char bigarray[257] = { 0,0,0,0,0,...1,0,1,0,... };
// note: EOF is -1
#define isspace(x) (bigarray[(x) + 1])
and in such case you can't pass for example -42
, cause bigarray[-41]
is just invalid.
Your question:
How is this:
while (isspace(* bp & 0xff))
++ bp;
different from this:
while (isspace(* bp))
++ bp;
The difference is, in the first example you are always passing the lowermost byte at bp
to isspace
, due to the result of a bitwise AND with a full bitmask (0b11111111
or 0xff
). It's possible that the argument to isspace
contains a type that is larger than 1 byte. For example, isspace
is defined as isspace(int c)
, so as you can see the argument here is an int
, which may be multiple bytes depending on your system.
In short, it's a sanity check to ensure that isspace
is only comparing a single byte from your bp
variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With