Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Characters extracted by istream >> double

Sample code at Coliru:

#include <iostream>
#include <sstream>
#include <string>

int main()
{
    double d; std::string s;

    std::istringstream iss("234cdefipxngh");
    iss >> d;
    iss.clear();
    iss >> s;
    std::cout << d << ", '" << s << "'\n";
}

I'm reading off N3337 here (presumably that is the same as C++11). In [istream.formatted.arithmetic] we have (paraphrased):

operator>>(double& val);

As in the case of the inserters, these extractors depend on the locale’s num_get<> (22.4.2.1) object to perform parsing the input stream data. These extractors behave as formatted input functions (as described in 27.7.2.2.1). After a sentry object is constructed, the conversion occurs as if performed by the following code fragment:

typedef num_get< charT,istreambuf_iterator<charT,traits> > numget;
iostate err = iostate::goodbit;
use_facet< numget >(loc).get(*this, 0, *this, err, val);
setstate(err);

Looking over to 22.4.2.1:

The details of this operation occur in three stages
— Stage 1: Determine a conversion specifier
— Stage 2: Extract characters from in and determine a corresponding char value for the format expected by the conversion specification determined in stage 1.
— Stage 3: Store results

In the description of Stage 2, it's too long for me to paste the whole thing here. However it clearly says that all characters should be extracted before conversion is attempted; and further that exactly the following characters should be extracted:

  • any of 0123456789abcdefxABCDEFX+-
  • The locale's decimal_point()
  • The locale's thousands_sep()

Finally, the rules for Stage 3 include:

— For a floating-point value, the function strtold.

The numeric value to be stored can be one of:

— zero, if the conversion function fails to convert the entire field.

This all seems to clearly specify that the output of my code should be 0, 'ipxngh'. However, it actually outputs something else.

Is this a compiler/library bug? Is there any provision that I'm overlooking for a locale to change the behaviour of Stage 2? (In another question someone posted an example of a system that does actually extract the characters, but also extracts ipxn which are not in the list specified in N3337).

Update

As pointed out by perreal, this text from Stage 2 is relevant:

If discard is true, then if ’.’ has not yet been accumulated, then the position of the character is remembered, but the character is otherwise ignored. Otherwise, if ’.’ has already been accumulated, the character is discarded and Stage 2 terminates. If it is not discarded, then a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.

If the character is either discarded or accumulated then in is advanced by ++in and processing returns to the beginning of stage 2.

So, Stage 2 can terminate if the character is in the list of allowed characters, but is not a valid character for %g. It doesn't say exactly, but presumably this refers to the definition of fscanf from C99 , which allows:

  • a nonempty sequence of decimal digits optionally containing a decimal-point character, then an optional exponent part as defined in 6.4.4.2;
  • a 0x or 0X, then a nonempty sequence of hexadecimal digits optionally containing a decimal-point character, then an optional binary exponent part as defined in 6.4.4.2;
  • INF or INFINITY, ignoring case
  • NAN or NAN(n-char-sequence opt ), ignoring case in the NAN part, where:

and also

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.

So, actually the Coliru output is correct; and in fact the processing must attempt to validate the sequence of characters extracted so far as a valid input to %g, while extracting each character.

Next question: is it permitted, as in the thread I linked to earlier, to accept i , n, p etc in Stage 2?

These are valid characters for %g , however they are not in the list of atoms which Stage 2 is allowed to read (i.e. c == 0 for my latest quote, so the character is neither discarded nor accumulated).

like image 771
M.M Avatar asked Jul 11 '14 02:07

M.M


1 Answers

This is a mess because it's likely that neither gcc/libstdc++'s nor clang/libc++'s implementation is conforming. It's unclear "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1" means, but I think that the use of the phrase "next character" indicates that check should be context-sensitive (i.e., dependent on the characters that have already been accumulated), and so an attempt to parse, e.g., "21abc", should stop when 'a' is encountered. This is consistent with the discussion in LWG issue 2041, which added this sentence back to the standard after it had been deleted during the drafting of C++11. libc++'s failure to do so is bug 17782.

libstdc++, on the other hand, refuses to parse "0xABp-4" past the 0, which is actually clearly nonconforming based on the standard (it should parse "0xAB" as a hexfloat, as clearly allowed by the C99 fscanf specification for %g).

The accepting of i, p, and n is not allowed by the standard. See LWG issue 2381.

The standard describes the processing very precisely - it must be done "as if" by the specified code fragment, which does not accept those characters. Compare the resolution of LWG issue 221, in which they added x and X to the list of characters because num_get as then-described won't otherwise parse 0x for integer inputs.

Clang/libc++ accepts "inf" and "nan" along with hexfloats but not "infinity" as an extension. See bug 19611.

like image 91
T.C. Avatar answered Sep 21 '22 12:09

T.C.