scanf(), field width, inf and nan

Question

Per the C standard from 1999, scanf() and strtod() should accept infinity and NaN as inputs (if supported by the implementation).

The descriptions of both functions have peculiar language, which may be open to interpretations.

scanf():

An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.

strtod():

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form.

While the latter excerpt appears to be strict in requiring the specific forms of "INF", "INFINITY", "NAN" or "NAN(n-char-sequence-opt)", the former isn't and one would think that the following code should produce infinity and NaN because the field width covers prefixes of matching input sequences:

int r;
double d;
d = 0; r = sscanf("inf", "%2le", &d);
printf("%d %e
", r, d);
d = 0; r = sscanf("nan", "%2le", &d);
printf("%d %e
", r, d);

There's also this bit on scanf():

a,e,f,g Matches an optionally signed floating-point number, infinity, or NaN, whose format is the same as expected for the subject sequence of the strtod function. The corresponding argument shall be a pointer to floating.

Is this simply a failure to document that a field width of 2, which is shorter than the expected shortest forms ("inf" or "nan"), does not make the otherwise matching prefixes "in" and "na" valid matches?

rici · Accepted Answer

In the specification of the behaviour of scanf, an "input item" is precisely the sequence of input characters consumed by the processing of a format specifier. After processing of the format specifier, whether that processing succeeds or not, the stream is positioned precisely after the last character in the input item, as is made clear by the sentence immediately following the definition of "input item" quoted in the question.

The first character, if any, after the input item remains unread.

Once the input item has been read, scanf proceeds to the next step (paragraph 10 of the same clause), in which the input item as a whole must be convertible according to the format specifier:

10 …[the input item] is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure.

"Matching sequence" is defined in the description of each format specifier; for a f specifier that will be:

the same as expected for the subject sequence of the strtod function.

as quoted in the question.

This is not the same algorithm as used by strtod. strtod finds the longest possible matching sequence, and provided there is one (even if only one character), it converts it and places the address of the next character in the supplied endptr argument.

By contrast, scanf has to deal with the limitations of an input stream, which does not allow reliable rewinding of the read pointer by more than one character. (See the definition of ungetc.) So scanf reads until it finds a character which could not extend the match, at which point it replaces that character into the input stream and attempts to convert what has been read to that point. Unlike strtod, it cannot backtrack to a shorter valid sequence, if there is one.

The other difference with strtod is that scanf can be restricted to a maximum length, useful for converting undelimited fixed-length input fields. With strtod it would be necessary to make a NUL-terminated copy of the fixed-length field, or temporarily insert a NUL at an appropriate point and later restore the overwritten character. In such a case, it would be important to verify that strtod consumed the entire input; if it did not, that would indicate garbage in the input.

The difference ~~can be seen~~ may be visible empirically. Given the input 1E-@, scanf ~~will~~ should report a matching failure and a subsequent getchar ~~will~~ should return '@'. strtod will return 1.0 with endptr pointing at E.

A specified format length can also cause scanf to return a matching error. Given input 1E-7@, scanf formats %2f and %3f ~~will~~ should; %1f will convert 1.0 and %4f (or larger) will convert .01, leaving @ for the next specifier (or subsequent input function). %2f applied to an input of inf or nan should exhibit precisely the same behaviour as 1E-7: a matching failure after absorbing two characters (because the truncated field is not a valid floating point number).

Whether the above happens or not depends on whether the implementation of the standard C library conforms with the standard. Glibc does not, and glibc is most probably what you will be using on a Linux platform regardless of whether you compile with gcc or clang, because clang does not bundle a standard C library, not even in the libcxx project.

The limited testing I was able to do on Windows (using an online compiler) suggests that the libcrt implementation of scanf does work as I would expect. My own examination of the source code for the FreeBSD library suggests that it's scanf will correctly report matching failures but may back the read cursor up by more than one character.

scanf(), field width, inf and nan

Tags:

c

floating-point

parsing

language-lawyer

scanf

Alexey Frunze

1 Answers

rici

Recent Activity

Donate For Us

scanf(), field width, inf and nan

Tags:

c

floating-point

parsing

language-lawyer

scanf

Alexey Frunze

1 Answers

rici

Related questions

Recent Activity

Donate For Us