Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent behavior of fscanf() across different compilers (consuming trailing null character)

I wrote a complete application in C99 and tested it thoroughly on two GNU/Linux-based systems. I was surprised when an attempt to compile it using Visual Studio on Windows resulted in the application misbehaving. At first I couldn't assert what was wrong, but I tried using the VC debugger, and then I discovered a discrepancy concerning the fscanf() function declared in stdio.h.

The following code is sufficient to demonstrate the problem:

#include <stdio.h>

int main() {
    unsigned num1, num2, num3;

    FILE *file = fopen("file.bin", "rb");
    fscanf(file, "%u", &num1);
    fgetc(file); // consume and discard \0
    fscanf(file, "%u", &num2);
    fgetc(file); // ditto
    fscanf(file, "%u", &num3);
    fgetc(file); // ditto
    fclose(file);

    printf("%d, %d, %d\n", num1, num2, num3);

    return 0;
}

Assume that file.bin contains exactly 512\0256\0128\0:

$ hexdump -C file.bin
00000000  35 31 32 00 32 35 36 00  31 32 38 00              |512.256.128.|

Now, when being compiled under GCC 4.8.4 on an Ubuntu machine, the resulting program reads the numbers as expected and prints 512, 256, 128 to stdout.
Compiling it with MinGW 4.8.1 on Windows gives the same, expected result.

However, there seems to be a major difference when I compile the code using Visual Studio Community 2015; namely, the output is:

512, 56, 28

As you can see, the trailing null characters have already been consumed by fscanf(), so fgetc() captures and discards characters that are essential to data integrity.

Commenting out the fgetc() lines makes the code work in VC, but breaks it in GCC (and possibly other compilers).

What is going on here, and how do I turn this into portable C code? Have I hit undefined behavior? Note that I'm assuming the C99 standard.

like image 548
rhino Avatar asked Feb 23 '17 16:02

rhino


People also ask

Does fscanf add NULL terminator?

The corresponding argument must be a pointer to the initial byte of an array of char, signed char or unsigned char large enough to accept the sequence and a terminating null character code, which will be added automatically.

What does fscanf return if failed?

The fscanf() function returns the number of fields that it successfully converted and assigned. The return value does not include fields that the fscanf() function read but did not assign. The return value is EOF if an input failure occurs before any conversion, or the number of input items assigned if successful.

Does fscanf only read one line at a time?

This means that even a tab ( \t ) in the format string can match a single space character in the input stream. Each call to fscanf() reads one line from the file.

Is fscanf safe?

The fscanf() function is generally considered unsafe for string handling; it's safer to use fgets() to get a line of input and then use sscanf() to process the input.


2 Answers

TL;DR: you've been bitten by MSVC non-conformance, a longstanding problem that MS has never shown much interest in solving. If you must support MSVC in addition to conforming C implementations, then one way to do so would be to engage conditional compilation directives to suppress the fgetc() calls when the program is compiled via MSVC.


I'm inclined to agree with the comments that reading binary data via formatted I/O functions is a questionable plan. Even more questionable, however, is the combination of

compil[ing] it using Visual Studio on Windows

and

assuming the C99 standard.

As far as I am aware, no version of MSVC conforms to C99. Very recent versions may do a better job of conforming to C2011, in part because C2011 makes some features optional that were mandatory in C99.

Whichever version of MSVC you're using, however, I think it fails to conform with the standard (both C99 and C2011) in this area. Here is the relevant text from C99, section 7.19.6.2

A conversion specification is executed in the following steps:

[...]

An input item is read from the stream [...]. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. The first character, if any, after the input item remains unread.

The standard is quite clear that the first character that does not match the input sequence remains unread, so the only ways MSVC could be considered conforming is if the \0 characters could be construed as being part of (and terminating) a matching input sequence, or if fgetc() were permitted to skip \0 characters. I see no justification for the latter, especially given that the stream was opened in binary mode, so let's consider the former.

For a u conversion specifier, a matching input sequence is defined as one that

Matches an optionally signed decimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 10 for the base argument.

The "subject sequence of the strtoul function" is defined in that function's specifications:

First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string.

Note in particular that the terminating null character is explicitly attributed to the final string of unrecognized characters. It is not part of the subject string, and therefore should not be matched by fscanf() when it converts input according to a u specifier.

like image 180
John Bollinger Avatar answered Sep 27 '22 18:09

John Bollinger


The MSVC implementation of fscanf is apparently "trashing" the NUL character next to the 512:

fscanf(file, "%u", &num1);

According to the fscanf documentation, this should not take place (emphasis mine):

For every conversion specifier other than n, the longest sequence of input characters which does not exceed any specified field width and which either is exactly what the conversion specifier expects or is a prefix of a sequence it would expect, is what's consumed from the stream. The first character, if any, after this consumed sequence remains unread.

Note that this is different than the situation when one would desire to skip trailing white characters as in following statement:

fscanf(file, "%u ", &num1); // notice "%u "

The spec says, that this occurs, only when the characters are identified by isspace property, which as checked, is not holding here (that is, isspace('\0') yields 0).

A hacky, regex-like workaround, that works in both MSVC and GCC may be to replace fgetc with:

fscanf(file, "%*1[^0-9+-]"); // skip at most one non-%u character

or more portably by replacing implementation-defined 0-9 character class with literal digits:

fscanf(file, "%*1[^0123456789+-]"); // skip at most one non-%u character
like image 37
Grzegorz Szpetkowski Avatar answered Sep 27 '22 17:09

Grzegorz Szpetkowski