Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

stringstream unsigned input validation

I'm writing part of program which parses and validates some user input in program console arguments. I choose to use stringstream for that purpose, but encounter a problem with unsigned types reading.

Next template is intended for reading requested type from given string:

#include <iostream>
#include <sstream>
#include <string>

using std::string;
using std::stringstream;
using std::cout;
using std::endl;

template<typename ValueType>
ValueType read_value(string s)
{   
    stringstream ss(s);
    ValueType res;
    ss >> res;
    if (ss.fail() or not ss.eof())
        throw string("Bad argument: ") + s;
    return res;
}
// +template specializations for strings, etc. 

int main(void)
{   
    cout << read_value<unsigned int>("-10") << endl;
}   

In case type is unsigned and input string contains negative number I expect to see exception throw (caused by ss.fail() = true). But stringstream produces casted to unsigned type value (4294967286 in written sample).

How can be this sample fixed to achieve desired behavior (preferable without fallback to c functions)? I understand that it can be done with simple first symbol check but I can put leading spaces for example. I can write own parser but don't believe that problem is so unpredictable and standard library unable to solve it.

Functions hidden in deep of stringstream operators for unsigned types are strtoull and strtoul. They work in described manner but mentioned functions are low-level. Why stringstream do not provide some validation level? (I just hope I'm wrong and it does but some movements required to enable this).

like image 729
Alexander Sergeyev Avatar asked Sep 20 '13 12:09

Alexander Sergeyev


2 Answers

Version disclaimer: The answer is different for C++03. The following deals with C++11.

First, let's analyse what's happening.

ss >> res; This calls std::istream::operator>>(unsigned). In [istream.formatted.arithmetic]/1, the effects are defined as follows:

These extractors behave as formatted input functions (as described in 27.7.2.2.1). After a sentry object is constructed, the conversion occurs as if performed by the following code fragment:

typedef num_get< charT,istreambuf_iterator<charT,traits> > numget;
iostate err = iostate::goodbit;
use_facet< numget >(loc).get(*this, 0, *this, err, val);
setstate(err);

In the above fragment, loc stands for the private member of the basic_ios class.

Following formatted input functions to [istream::sentry], the main effect of the sentry object here is to consume leading white-space characters. It also prevents executing of the code shown above in case of an error (stream is in failed / eof state).

The used locale is the "C" locale. Rationale:

For a the stringstream constructed via stringstream ss(s);, the locale of that iostream is the current global locale at the time of construction (that's guaranteed deep down in the rabbit hole at [ios.base.locales]/4). As the global locale hasn't been changed in the OP's program, [locale.cons]/2 specifies the "classic" locale, i.e. the "C" locale.

use_facet< numget >(loc).get uses the member function num_get<char>::get(iter_type in, iter_type end, ios_base&, ios_base::iostate& err, unsigned int& v) const; specified in [locale.num.get] (note the unsigned int, everything is still fine). The details of the string -> unsigned int conversion for the "C" locale are lengthy and described in [facet.num.get.virtuals]. Some interesting details:

  • For an unsigned integer value, the function strtoull is used.
  • If the conversion fails, ios_base::failbit is assigned to err. Specifically: "The numeric value to be stored can be one of: [...] the most negative representable value or zero for an unsigned integer type, if the field represents a value too large negative to be represented in val. ios_base::failbit is assigned to err."

We need to go to C99, 7.20.1.4 for the definition of strtoull, under paragraph 5:

If the subject sequence begins with a minus sign, the value resulting from the conversion is negated (in the return type).

and under paragraph 8:

If the correct value is outside the range of representable values, LONG_MIN, LONG_MAX, LLONG_MIN, LLONG_MAX, ULONG_MAX, or ULLONG_MAX is returned (according to the return type and sign of the value, if any), and the value of the macro ERANGE is stored in errno

It seems that it has been debated in the past if negative values are considered valid input for strotoul. In any case, the problem lies here with this function. A quick check on gcc says that it's considered valid input, and therefore the behaviour you observed.


Historic note: C++03

C++03 used scanf inside the num_get conversion. Unfortunately, I'm not quite sure (yet) how the conversion for scanf is specified, and under which circumstances errors occur.


An explicit error check:

We can manually insert that check either by using a signed value for conversion and testing <0, or we look for the - character (which isn't a good idea because of possible localization issues).

like image 92
dyp Avatar answered Oct 17 '22 00:10

dyp


A num_get facet to support the explicit check for signedness. Rejects any non-zero number beginning with a '-' (after white-spaces) for unsigned types and uses the default C locale's num_get to do the actual conversion.

#include <locale>
#include <istream>
#include <ios>
#include <algorithm>

template <class charT, class InputIterator = std::istreambuf_iterator<charT> >
class num_get_strictsignedness : public std::num_get <charT, InputIterator>
{
public:
    typedef charT char_type;
    typedef InputIterator iter_type;

    explicit num_get_strictsignedness(std::size_t refs = 0)
        : std::num_get<charT, InputIterator>(refs)
    {}
    ~num_get_strictsignedness()
    {}

private:
    #define DEFINE_DO_GET(TYPE) \
        virtual iter_type do_get(iter_type in, iter_type end,      \
            std::ios_base& str, std::ios_base::iostate& err,       \
            TYPE& val) const override                              \
        {  return do_get_templ(in, end, str, err, val);  }         // MACRO END

    DEFINE_DO_GET(unsigned short)
    DEFINE_DO_GET(unsigned int)
    DEFINE_DO_GET(unsigned long)
    DEFINE_DO_GET(unsigned long long)

    // not sure if a static locale::id is required..

    template <class T>
    iter_type do_get_templ(iter_type in, iter_type end, std::ios_base& str,
                           std::ios_base::iostate& err, T& val) const
    {
        using namespace std;

        if(in == end)
        {
            err |= ios_base::eofbit;
            return in;
        }

        // leading white spaces have already been discarded by the
        // formatted input function (via sentry's constructor)

        // (assuming that) the sign, if present, has to be the first character
        // for the formatting required by the locale used for conversion

        // use the "C" locale; could use any locale, e.g. as a data member

        // note: the signedness check isn't actually required
        //       (because we only overload the unsigned versions)
        bool do_check = false;
        if(std::is_unsigned<T>{} && *in == '-')
        {
            ++in;  // not required
            do_check = true;
        }

        in = use_facet< num_get<charT, InputIterator> >(locale::classic())
                 .get(in, end, str, err, val);

        if(do_check && 0 != val)
        {
            err |= ios_base::failbit;
            val = 0;
        }

        return in;
    }
};

Usage example:

#include <sstream>
#include <iostream>
int main()
{
    std::locale loc( std::locale::classic(),
                     new num_get_strictsignedness<char>() );
    std::stringstream ss("-10");
    ss.imbue(loc);
    unsigned int ui = 42;
    ss >> ui;
    std::cout << "ui = "<<ui << std::endl;
    if(ss)
    {
        std::cout << "extraction succeeded" << std::endl;
    }else
    {
        std::cout << "extraction failed" << std::endl;
    }
}

Notes:

  • the allocation on the free store is not required, you could use e.g. a (static) local variable where you initialize the ref counter with 1 in the ctor
  • for every character type you want to support (like char, wchar_t, charXY_t), you need to add an own facet (can be different instantiations of the num_get_strictsignedness template)
  • "-0" is accepted
like image 23
7 revs Avatar answered Oct 17 '22 00:10

7 revs