I want to understand the following code:
//...
#define _C 0x20
extern const char *_ctype_;
//...
__only_inline int iscntrl(int _c)
{
return (_c == -1 ? 0 : ((_ctype_ + 1)[(unsigned char)_c] & _C));
}
It originates from the file ctype.h from the obenbsd operating system source code. This function checks if a char is a control character or a printable letter inside the ascii range. This is my current chain of thought:
Somehow, strangely, it works and everytime when 0 is returned the given char _c is not a printable character. Otherwise when it's printable the function just returns an integer value that's not of any special interest. My problem of understanding is in step 3, 4 (a bit) and 5.
Thank you for any help.
_ctype_
appears to be a restricted internal version of the symbol table and I'm guessing the + 1
is that they didn't bother saving index 0
of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.
The C standard dictates this for all ctype.h functions:
In all cases the argument is an
int
, the value of which shall be representable as anunsigned char
or shall equal the value of the macroEOF
Going through the code step by step:
int iscntrl(int _c)
The int
types are really characters, but all ctype.h functions are required to handle EOF
, so they must be int
.-1
is a check against EOF
, since it has the value -1
._ctype+1
is pointer arithmetic to get an address of an array item.[(unsigned char)_c]
is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char
. Note that char
can actually hold a negative value, so this is defensive programming. The result of the []
array access is a single character from their internal symbol table.&
masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table._ctype_
is a pointer to a global array of 257 bytes. I don't know what _ctype_[0]
is used for. _ctype_[1]
through _ctype_[256]_
represent the character categories of characters 0, …, 255 respectively: _ctype_[c + 1]
represents the category of the character c
. This is the same thing as saying that _ctype_ + 1
points to an array of 256 characters where (_ctype_ + 1)[c]
represents the categorty of the character c
.
(_ctype_ + 1)[(unsigned char)_c]
is not a declaration. It's an expression using the array subscript operator. It's accessing position (unsigned char)_c
of the array that starts at (_ctype_ + 1)
.
The code casts _c
from int
to unsigned char
is not strictly necessary: ctype functions take char values cast to unsigned char
(char
is signed on OpenBSD): a correct call is char c; … iscntrl((unsigned char)c)
. They have the advantage of guaranteeing that there is no buffer overflow: if the application calls iscntrl
with a value that is outside the range of unsigned char
and isn't -1, this function returns a value which may not be meaningful but at least won't cause a crash or a leak of private data that happened to be at the address outside of the array bounds. The value is even correct if the function is called as char c; … iscntrl(c)
as long as c
isn't -1.
The reason for the special case with -1 is that it's EOF
. Many standard C functions that operate on a char
, for example getchar
, represent the character as an int
value which is the char value wrapped to a positive range, and use the special value EOF == -1
to indicate that no character could be read. For functions like getchar
, EOF
indicates the end of the file, hence the name end-of-file. Eric Postpischil suggests that the code was originally just return _ctype_[_c + 1]
, and that's probably right: _ctype_[0]
would be the value for EOF. This simpler implementation yields to a buffer overflow if the function is misused, whereas the current implementation avoids this as discussed above.
If v
is the value found in the array, v & _C
tests if the bit at 0x20
is set in v
. The values in the array are masks of the categories that the character is in: _C
is set for control characters, _U
is set for uppercase letters, etc.
I'll start with step 3:
increment the adress the undefined pointer points to by 1
The pointer is not undefined. It's just defined in some other compilation unit. That is what the extern
part tells the compiler. So when all files are linked together, the linker will resolve the references to it.
So what does it point to?
It points to an array with information about each character. Each character has its own entry. An entry is a bitmap representation of characteristics for the character. For example: If bit 5 is set, it means that the character is a control character. Another example: If bit 0 is set, it means that the character is a upper character.
So something like (_ctype_ + 1)['x']
will get the characteristics that apply to 'x'
. Then a bitwise and is performed to check if bit 5 is set, i.e. check whether it is a control character.
The reason for adding 1 is probably that the real index 0 is reserved for some special purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With