I'm currently building a bit of HTTP handling into a C program (compiled using glibc on Linux), which will sit behind an nginx instance, and figured I should be safe deferring argument tokenization to sscanf
in this scenario.
I was very pleased to find that extracting the query out of the URI was straightforward:
char *path = "/events?a=1&b=2&c=3";
char query[64] = {0};
sscanf(path, "%*[^?]?%64s HTTP", query); // query = "a=1&b=2&c=3"
but I was surprised how quickly things became i͏̠͚̣̗̲n͓̭̞̹t͈e҉̝̟̘̺r͈e̫st̩̟̠i͏͈͇n͏̠͍g̞͝ :(
int pos = -1;
char arg[32] = {0}, value[32] = {0};
int c = sscanf(query, "%32[^=]=%32[^&]&%n", &arg, &value, &pos);
For an input of a=1&b=2
, I get arg="a"
, value="1"
, c=2
, pos=4
. Perfect: I can now rerun sscanf on path + pos
to get the next argument. Why am I here?
Well, while a=1&
behaves identically to the above, a=1
produces arg="a"
, value="1"
, c=2
, and pos=-1
. What do I make of this?
Scrambling for the documentation, I read that
n Nothing is expected; instead, the number of characters consumed
thus far from the input is stored through the next pointer,
which must be a pointer to int. This is not a conversion and
does not increase the count returned by the function. The as‐
signment can be suppressed with the * assignment-suppression
character, but the effect on the return value is undefined.
Therefore %*n conversions should not be used.
where more than 50% of the paragraph refers to bookkeeping minutiae. The behavior I am seeing is not discussed.
Wandering around Google search results I quickly reached for Wikipedia's entry for Scanf_format_string (which was the top hit), but, uh...
Oookay... I feel like I'm in the tumbleweeds here using a feature nobody really looks at. That doesn't inspire my remaining confidence.
Taking a look at what appears to be where %n
is implemented in vfscanf-internal.c, I find that 60% of the code (lines) involves discussion regarding standards inconsistencies, 39.6% is implementation minutiae, and 0.4% is actual code (which consists in its entirety of "done++;
").
It *appears* that glibc's behavior is to leave the internal value done
(which I access using %n
) untouched - or rather, undefined - unless some operation alters it. It also appears that using %n
in this way was unforeseen and that I'm completely in "here be dragons" territory? :(
I don't think I'm going to be using scanf
...
For the sake of completeness, here's something that wraps up what I'm seeing.
#include <stdio.h>
void test(const char *str) {
int pos = -1;
char arg[32] = {0}, value[32] = {0};
int c = sscanf(str, "%32[^=]=%32[^&]&%n", (char *)&arg, (char *)&value, &pos);
printf("\"%s\": c=%d arg=\"%s\" value=\"%s\" pos=%d\n", str, c, arg, value, pos);
}
int main() {
test("a=1&b=2"); // "a=1&b=2": c=2 arg="a" value="1" pos=4
test("a=1&"); // "a=1&": c=2 arg="a" value="1" pos=4
test("a=1"); // "a=1": c=2 arg="a" value="1" pos=-1
}
Return Value The scanf() function returns the number of fields that were successfully converted and assigned. The return value does not include fields that were read but not assigned. The return value is EOF for an attempt to read at end-of-file if no conversion was performed.
scanf returns integer values as to the number of valid values read from the standard input console. So if you have a scanf with reading just 1 of the values say integer or character, then it would return 1 if the item is read correctly and stored in the provided variable.
%c and %s are part of the printf() functions in the standard library, not part of the language itself.
You don't check if the scanf actually succeeded, therefore you will get stuck on error. With each loop, the scanf will try to read and fail.
I think the C standard guarantees that the value of pos
in your example remains unchanged.
C17 7.21.6.2 says, describing fscanf
:
(4) The fscanf function executes each directive of the format in turn. When all directives have been executed, or if a directive fails (as detailed below), the function returns. Failures are described as input failures (due to the occurrence of an encoding error or the unavailability of input characters), or matching failures (due to inappropriate input).
[...]
(6) A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive,the directive fails and the differing and subsequent characters remain unread. Similarly, if end-of-file, an encoding error, or a read error prevents a character from being read, the directive fails.
("Multibyte character" here includes ordinary single-byte characters such as your &
.)
So in your "a=1"
example, the directives %32[^=]
, =
, and %32[^&]
all succeed, and now the end of the string has been reached. It's explained in 7.21.6.7 that for sscanf
, "reaching the end of the string is equivalent to
encountering end-of-file for the fscanf function." Hence no character can be read, so the &
directive fails, and sscanf
returns without doing anything further. The %n
directive never executed, and so nothing happened that would have the right to modify the value of pos
. Therefore it must have the same value it had before, namely -1.
I don't think this case was unforeseen; just that it's already covered by existing rules, and so nobody bothered to call it out explicitly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With