Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does sscanf require a null terminated string as input?

A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.

My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:

int test(void) {
    char buf[1] = { '1' };
    int v;
    sscanf(buf, "%1d", &v);
    return v;
}

The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?


(1) references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40

like image 851
chqrlie Avatar asked Mar 01 '21 21:03

chqrlie


People also ask

Does sscanf null terminate?

No terminating null is added. Pointer to char large enough for input field. Like c , a sequence of bytes of type char (signed or unsigned), except that white space characters are not allowed, and a terminating null is always added.

Does scanf read null terminator?

Yes, it possible because '\0' is a non-white-space. scanf() will consider that is the end of the string. So %s can match a empty string. You can use the m specifier to allocate a corresponding buffer to hold the input.

Does Readinator include Null?

So yes, read() will read NUL bytes.

What does the sscanf function do?

Description. The sscanf() function reads data from buffer into the locations that are given by argument-list. Each argument must be a pointer to a variable with a type that corresponds to a type specifier in the format-string.


2 Answers

Here are the relevant parts from the C Standard:

7.21.6.7 The sscanf function Synopsis

Synopsis

#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);

Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.

Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.

The input is specifically referred to as a string, so it should be null terminated

Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.

To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.

In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.

like image 60
chqrlie Avatar answered Oct 18 '22 00:10

chqrlie


When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.

The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

like image 37
supercat Avatar answered Oct 18 '22 00:10

supercat