I wrote a complete application in C99 and tested it thoroughly on two GNU/Linux-based systems. I was surprised when an attempt to compile it using Visual Studio on Windows resulted in the application misbehaving. At first I couldn't assert what was wrong, but I tried using the VC debugger, and then I discovered a discrepancy concerning the fscanf()
function declared in stdio.h
.
The following code is sufficient to demonstrate the problem:
#include <stdio.h>
int main() {
unsigned num1, num2, num3;
FILE *file = fopen("file.bin", "rb");
fscanf(file, "%u", &num1);
fgetc(file); // consume and discard \0
fscanf(file, "%u", &num2);
fgetc(file); // ditto
fscanf(file, "%u", &num3);
fgetc(file); // ditto
fclose(file);
printf("%d, %d, %d\n", num1, num2, num3);
return 0;
}
Assume that file.bin contains exactly 512\0256\0128\0
:
$ hexdump -C file.bin
00000000 35 31 32 00 32 35 36 00 31 32 38 00 |512.256.128.|
Now, when being compiled under GCC 4.8.4 on an Ubuntu machine, the resulting program reads the numbers as expected and prints 512, 256, 128
to stdout.
Compiling it with MinGW 4.8.1 on Windows gives the same, expected result.
However, there seems to be a major difference when I compile the code using Visual Studio Community 2015; namely, the output is:
512, 56, 28
As you can see, the trailing null characters have already been consumed by fscanf()
, so fgetc()
captures and discards characters that are essential to data integrity.
Commenting out the fgetc()
lines makes the code work in VC, but breaks it in GCC (and possibly other compilers).
What is going on here, and how do I turn this into portable C code? Have I hit undefined behavior? Note that I'm assuming the C99 standard.
The corresponding argument must be a pointer to the initial byte of an array of char, signed char or unsigned char large enough to accept the sequence and a terminating null character code, which will be added automatically.
The fscanf() function returns the number of fields that it successfully converted and assigned. The return value does not include fields that the fscanf() function read but did not assign. The return value is EOF if an input failure occurs before any conversion, or the number of input items assigned if successful.
This means that even a tab ( \t ) in the format string can match a single space character in the input stream. Each call to fscanf() reads one line from the file.
The fscanf() function is generally considered unsafe for string handling; it's safer to use fgets() to get a line of input and then use sscanf() to process the input.
TL;DR: you've been bitten by MSVC non-conformance, a longstanding problem that MS has never shown much interest in solving. If you must support MSVC in addition to conforming C implementations, then one way to do so would be to engage conditional compilation directives to suppress the fgetc()
calls when the program is compiled via MSVC.
I'm inclined to agree with the comments that reading binary data via formatted I/O functions is a questionable plan. Even more questionable, however, is the combination of
compil[ing] it using Visual Studio on Windows
and
assuming the C99 standard.
As far as I am aware, no version of MSVC conforms to C99. Very recent versions may do a better job of conforming to C2011, in part because C2011 makes some features optional that were mandatory in C99.
Whichever version of MSVC you're using, however, I think it fails to conform with the standard (both C99 and C2011) in this area. Here is the relevant text from C99, section 7.19.6.2
A conversion specification is executed in the following steps:
[...]
An input item is read from the stream [...]. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. The first character, if any, after the input item remains unread.
The standard is quite clear that the first character that does not match the input sequence remains unread, so the only ways MSVC could be considered conforming is if the \0
characters could be construed as being part of (and terminating) a matching input sequence, or if fgetc()
were permitted to skip \0
characters. I see no justification for the latter, especially given that the stream was opened in binary mode, so let's consider the former.
For a u
conversion specifier, a matching input sequence is defined as one that
Matches an optionally signed decimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 10 for the base argument.
The "subject sequence of the strtoul function" is defined in that function's specifications:
First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string.
Note in particular that the terminating null character is explicitly attributed to the final string of unrecognized characters. It is not part of the subject string, and therefore should not be matched by fscanf()
when it converts input according to a u
specifier.
The MSVC implementation of fscanf
is apparently "trashing" the NUL
character next to the 512
:
fscanf(file, "%u", &num1);
According to the fscanf
documentation, this should not take place (emphasis mine):
For every conversion specifier other than n, the longest sequence of input characters which does not exceed any specified field width and which either is exactly what the conversion specifier expects or is a prefix of a sequence it would expect, is what's consumed from the stream. The first character, if any, after this consumed sequence remains unread.
Note that this is different than the situation when one would desire to skip trailing white characters as in following statement:
fscanf(file, "%u ", &num1); // notice "%u "
The spec says, that this occurs, only when the characters are identified by isspace
property, which as checked, is not holding here (that is, isspace('\0')
yields 0).
A hacky, regex-like workaround, that works in both MSVC and GCC may be to replace fgetc
with:
fscanf(file, "%*1[^0-9+-]"); // skip at most one non-%u character
or more portably by replacing implementation-defined 0-9
character class with literal digits:
fscanf(file, "%*1[^0123456789+-]"); // skip at most one non-%u character
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With