A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of <code>sscanf()</code> call <code>strlen()</code> on their input string to set up a context object for an internal routine shared with other scanning functions (<code>scanf()</code>, <code>fscanf()</code>...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to <code>sscanf()</code> with an offset and a <code>%n</code> conversion proved to be a dominant cause for the load time. My question is should <code>sscanf()</code> even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior: <pre class="prettyprint"><code>int test(void) { char buf[1] = { '1' }; int v; sscanf(buf, "%1d", &v); return v; } </code></pre> The function should return <code>1</code> and does not need to read more than one byte from <code>buf</code>, but is <code>sscanf()</code> allowed to read from <code>buf</code> beyond the first byte? <hr> (1) references provided by JdeBP: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ https://news.ycombinator.com/item?id=26297612 https://github.com/biojppm/rapidyaml/issues/40

Here are the relevant parts from the C Standard: <blockquote> 7.21.6.7 The <code>sscanf</code> function Synopsis Synopsis <pre class="prettyprint"><code>#include <stdio.h> int sscanf(const char * restrict s, const char * restrict format, ...); </code></pre> Description The <code>sscanf</code> function is equivalent to <code>fscanf</code>, except that input is obtained from a string (specified by the argument <code>s</code>) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the <code>fscanf</code> function. If copying takes place between objects that overlap, the behavior is undefined. Returns The <code>sscanf</code> function returns the value of the macro <code>EOF</code> if an input failure occurs before the first conversion (if any) has completed. Otherwise, the <code>sscanf</code> function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure. </blockquote> The input is specifically referred to as a string, so it should be null terminated Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call <code>strlen()</code> on it to determine the input length. To avoid linear time complexity on long input strings, <code>sscanf()</code> should limit the scan for the end of string to a small size with <code>strnlen()</code> or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach. In the mean time, programmers should avoid passing long input strings to <code>sscanf()</code> and use more specialized functions for their parsing tasks, such as <code>strtol()</code>, which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.

Does sscanf require a null terminated string as input?

Tags:

c

language-lawyer

undefined-behavior

scanf

A recently discovered explanation for GTA lengthy load times⁽¹⁾ showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.

My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:

int test(void) {
    char buf[1] = { '1' };
    int v;
    sscanf(buf, "%1d", &v);
    return v;
}

The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?

⁽¹⁾ references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40

851

asked Mar 01 '21 21:03

chqrlie

2 Answers

Here are the relevant parts from the C Standard:

7.21.6.7 The sscanf function Synopsis

Synopsis
#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);
Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.

Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.

The input is specifically referred to as a string, so it should be null terminated

Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.

To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.

In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.

answered Oct 18 '22 00:10

chqrlie

When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.

The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

answered Oct 18 '22 00:10

supercat

Related questions
                            
                                Where can I find soft-multiply and divide algorithms?
                            
                                Is there a library repository for C? [duplicate]
                            
                                valgrind mac os mem leak
                            
                                How does GCC decide what order to output assembly functions in?
                            
                                When did constant 0 in pointer context acquire its special status?
                            
                                Python extension module with variable number of arguments
                            
                                Shared memory between python processes
                            
                                C preprocessor Literal Construction
                            
                                Partially disable pedantic warnings in gcc within source
                            
                                Address family not supported by protocol
                            
                                Atomic operations and code generation for gcc
                            
                                gdb pretty printing with python a recursive structure
                            
                                How to measure a functions stack usage in C?
                            
                                Why const variable need not to be initialized in C?
                            
                                Is it safe to completely ignore variadic arguments to a C function?
                            
                                Integration issue to tess-two (Tesseract Tools for Android)library into an Android studio and build ndk
                            
                                How to force unsigned arithmetic on fixed-width types?
                            
                                Implementation of sinpi() and cospi() using standard C math library
                            
                                How to build mcrypt library in armv7 architecture?
                            
                                Could WASM be used to check integrity of a JS method?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does sscanf require a null terminated string as input?

Tags:

c

language-lawyer

undefined-behavior

scanf

chqrlie

People also ask

2 Answers

chqrlie

supercat

Recent Activity

Donate For Us