Inspired by this question about the following code from SQLite3:
static int strlen30(const char *z){
const char *z2 = z;
while( *z2 ){ z2++; }
return 0x3fffffff & (int)(z2 - z);
}
that is accompanied by a commit message saying this function helps with int
overflows.
I'm particularly interested in this part:
const char *z2 = z;
while( *z2 ){ z2++; }
to me this loop advances z2
until z2
points onto null terminator. Then z2-z
yields the string length.
Why not use strlen()
for this part and rewrite like this:
return 0x3fffffff & (int)(strlen(z));
Why use loop+subtraction instead of strlen()
? What can loop+subtraction do what strlen()
can't?
strlen() in C-style strings can be replaced by C++ std::strings. sizeof() in C is as an argument to functions like malloc(), memcpy() or memset() can be replaced by C++ (use new, std::copy(), and std::fill() or constructors).
If there is a bug in the library writing this string, then the string would not be zero terminated, and the strlen could fail.
The strlen() function calculates the length of a given string. The strlen() function takes a string as an argument and returns its length. The returned value is of type size_t (an unsigned integer type).
I can't tell you the reason why they had to re-implement it, and why they chose int
instead if size_t
as the return type. But about the function:
/*
** Compute a string length that is limited to what can be stored in
** lower 30 bits of a 32-bit signed integer.
*/
static int strlen30(const char *z){
const char *z2 = z;
while( *z2 ){ z2++; }
return 0x3fffffff & (int)(z2 - z);
}
The standard says in (ISO/IEC 14882:2003(E)) 3.9.1 Fundamental Types, 4.:
Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number of bits in the value representation of that particular size of integer. 41)
...
41): This implies that unsigned arithmetic does not overflow because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type
That part of the standard does not define overflow-behaviour for signed integers. If we look at 5. Expressions, 5.:
If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined, unless such an expression is a constant expression (5.19), in which case the program is ill-formed. [Note: most existing implementations of C + + ignore integer overflows. Treatment of division by zero, forming a remainder using a zero divisor, and all floating point exceptions vary among machines, and is usually adjustable by a library function. ]
So far for overflow.
As for subtracting two pointers to array elements, 5.7 Additive operators, 6.:
When two pointers to elements of the same array object are subtracted, the result is the difference of the subscripts of the two array elements. The type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as ptrdiff_t in the header (18.1). [...]
Looking at 18.1:
The contents are the same as the Standard C library header stddef.h
So let's look at the C standard (I only have a copy of C99, though), 7.17 Common Definitions :
- The types used for size_t and ptrdiff_t should not have an integer conversion rank greater than that of signed long int unless the implementation supports objects large enough to make this necessary.
No further guarantee made about ptrdiff_t
. Then, Annex E (still in ISO/IEC 9899:TC2) gives the minimum magnitude for signed long int, but not a maximum:
#define LONG_MAX +2147483647
Now what are the maxima for int
, the return type for sqlite - strlen30()
? Let's skip the C++ quotation that forwards us to the C-standard once again, and we'll see in C99, Annex E, the minimum maximum for int
:
#define INT_MAX +32767
ptrdiff_t
is not bigger than signed long
, which is not smaller than 32bits. int
is just defined to be at least 16bits long.int
of your platform.strlen30
does applies a bitwise or upon the pointer-subtract-result: | 32 bit |
ptr_diff |10111101111110011110111110011111| // could be even larger
& |00111111111111111111111111111111| // == 3FFFFFFF<sub>16</sub>
----------------------------------
= |00111101111110011110111110011111| // truncated
That prevents undefiend behaviour by truncation of the pointer-subtraction result to a maximum value of 3FFFFFFF16 = 107374182310.
I am not sure about why they chose exactly that value, because on most machines, only the most significant bit tells the signedness. It could have made sense versus the standard to choose the minimum INT_MAX
, but 1073741823 is indeed slightly strange without knowing more details (though it of course perfectly does what the comment above their function says: truncate to 30bits and prevent overflow).
and rewrite it like this:
return 0x3fffffff & (int)(strlen(z));
My guess is that they wanted to avoid a potential indirection. Another advantage might be fewer dependencies on the standard library, which can be useful if you write a non-hosted application.
Btw, as follows from the references above, (int)(strlen(z))
might yield undefined behaviour if the maximum for ptrdiff_t > INT_MAX
, so (int)(0x3fffffff & strlen(z))
would be better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With