Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently reading two comma-separated floats in brackets from a string without being affected by the global locale

Tags:

c++

std

parsing

I am a developer of a library and our old code uses sscanf() and sprintf() to read/write a variety of internal types from/to strings. We have had issues with users who used our library and had a locale that was different from the one we based our XML files on ("C" locale). In our case this resulted in incorrect values parsed from those XML files and those submitted as strings in run-time. The locale may be changed by a user directly but can also be changed without the knowledge of the user. This can happen if the locale-changes occurs inside another library, such as GTK, which was the "perpetrator" in one bug report. Therefore, we obviously want to remove any dependency from the locale to permanently free ourselves from these issues.

I have already read other questions and answers in the context of float/double/int/... especially if they are separated by a character or located inside brackets, but so far the proposed solutions I found were not satisfying to us. Our requirements are:

  1. No dependencies on libraries other than the standard library. Using anything from boost is therefore, for example, not an option.

  2. Must be thread-safe. This is meant in specific regarding the locale, which can be changed globally. This is really awful for us, as therefore a thread of our library can be affected by another thread in the user's program, which may also be running code of a completely different library. Anything affected by setlocale() directly is therefore not an option. Also, setting the locale before starting to read/write and setting it back to the original value thereafter is not a solution due to race conditions in threads.

  3. While efficiency is not the topmost priority (#1 & #2 are), it is still definitely of our concern, as strings may be read and written in run-time quite frequently, depending on the user's program. The faster, the better.

Edit: As an additional note: boost::lexical_cast is not guaranteed to be unaffected by the locale (source: Locale invariant guarantee of boost::lexical_cast<>). So that would not be a solution even without requirement #1.

I gathered the following information so far:

  • First of all, what I saw being suggested a lot is using boost's lexical_cast but unfortunately this is not an option for us as at all, as we can't require all users to also link to boost (and because of the lacking locale-safety, see above). I looked at the code to see if we can extract anything from it but I found it difficult to understand and too large in length, and most likely the big performance-gainers are using locale-dependent functions anyways.
  • Many functions introduced in C++11, such as std::to_string, std::stod, std::stof, etc. depend on the global locale just the way sscanf and sprintf do, which is extremely unfortunate and to me not understandable, considering that std::thread has been added.
  • std::stringstream seems to be a solution in general, since it is thread-safe in the context of the locale, but also in general if guarded right. However, if it is constructed freshly every time it can be slow (good comparison: http://www.boost.org/doc/libs/1_55_0/doc/html/boost_lexical_cast/performance.html). I assume this can be solved by having one such stream per thread configured and available, clearing it each time after usage. However, a problem is that it doesn't solve formats as easily as sscanf() does, for example: " { %g , %g } ".

sscanf() patterns that we, for example, need to be able to read are:

  • " { %g , %g }"
  • " { { %g , %g } , { %g , %g } }"
  • " { top: { %g , %g } , left: { %g , %g } , bottom: { %g , %g } , right: { %g , %g }"

Writing these with stringstreams seems no big deal, but reading them seems problematic, especially considering the whitespaces.

Should we use std::regex in this context or is this overkill? Are stringstreams a good solution for this task or is there any better way to do this given the mentioned requirements? Also, are there any other problems in the context of thread-safety and locales that I have not considered in my question - especially regarding the usage of std::stringstream?

like image 837
Ident Avatar asked Aug 13 '15 00:08

Ident


2 Answers

In your case the stringstream seems to be the best approach, as you can control it's locale independently of the global locale that was set. But it's true that the formatted reading is not as easy as with sscanf().

Form the point of view of performance, stream input with regex is an overkill for this kind of simple comma separated reading : on an informal benchmark it was more than 10 times slower than a scanf().

You can easily write a little auxiliary class to facilitate reading formats like you have enumerated. Here the general idea on another SO answer The use can be as easy as:

sst >> mandatory_input(" { ")>> x >> mandatory_input(" , ")>>y>> mandatory_input(" } ");

If you're interested, I've written one some time ago. Here the full article with examples and explanation as well as source code. The class is 70 lines of code, but most of them to provide error processing functions in case these are needed. It has acceptable performance, but is still slower than scanf().

like image 148
Christophe Avatar answered Oct 06 '22 23:10

Christophe


Based on the suggestions by Christophe and some other stackoverflow answers I found, I created a set of 2 methods and 1 class to achieve all stream parsing functionality we required. The following methods are sufficient to parse the formats proposed in the question:

The following methods strips preceding whitespaces and then skips an optional character:

template<char matchingCharacter>
std::istream& optionalChar(std::istream& inputStream)
{
    if (inputStream.fail())
        return inputStream;

    inputStream >> std::ws;
    if (inputStream.peek() == matchingCharacter)
        inputStream.ignore();
    else
        // If peek is executed but no further characters remain,
        // the failbit will be set, we want to undo this
        inputStream.clear(inputStream.rdstate() & ~std::ios::failbit);
    return inputStream;
}

The second methods strips preceding whitespaces and then checks for a mandatory character. If it doesn't match, the fail bit will be set:

template<char matchingCharacter>
std::istream& mandatoryChar(std::istream& inputStream)
{
    if (inputStream.fail())
        return inputStream;

    inputStream >> std::ws;
    if (inputStream.peek() == matchingCharacter)
        inputStream.ignore();
    else
        inputStream.setstate(std::ios_base::failbit);

    return inputStream;
}

It makes sense to use a global stringstream (call strStream.str(std::string()) and call clear() before each usage) to increase performance, as hinted to in my question. With the optional character checks I could make the parsing more lenient towards other styles. Here is an example usage:

// Format is: " { { %g , %g } , { %g , %g } } " but we are lenient regarding the format,
// so this is also allowed: " { %g %g } { %g %g } "
std::stringstream sstream(inputString);
sstream.clear();
sstream >> optionalChar<'{'> >> mandatoryChar<'{'> >> val1 >>
    optionalChar<','> >> val2 >>
    mandatoryChar<'}'> >> optionalChar<','> >> mandatoryChar<'{'> >> val3 >>
    optionalChar<','> >> val4;
if (sstream.fail())
    logError(inputString);

Addition - Checking for mandatory strings:

Last but not least I created a class for checking for mandatory strings in streams from scratch, based on the idea by Christophe. Header-file:

class MandatoryString
{
public:
    MandatoryString(char const* mandatoryString);

    friend std::istream& operator>> (std::istream& inputStream, const MandatoryString& mandatoryString);  

private:
    char const* m_chars;
};

Cpp file:

MandatoryString::MandatoryString(char const* mandatoryString)
    : m_chars(mandatoryString)
{}

std::istream& operator>> (std::istream& inputStream, const MandatoryString& mandatoryString) 
{
    if (inputStream.fail())
        return inputStream;

    char const* currentMandatoryChar = mandatoryString.m_chars;

    while (*currentMandatoryChar != '\0')
    {
        static const std::locale spaceLocale("C");

        if (std::isspace(*currentMandatoryChar, spaceLocale))
        {
            inputStream >> std::ws;
        }
        else
        {
            int peekedChar = inputStream.get();
            if (peekedChar != *currentMandatoryChar)
            {
                inputStream.setstate(std::ios::failbit); 
                break;
            }
        }
        ++currentMandatoryChar;
    }
    return inputStream;
}

The MandatoryString class is used similar to the above methods, e.g.:

sstream >> MandatoryString(" left");

Conclusion: While this solution might be more verbose than sscanf, it gives us all the flexibility we needed while being able to use stringstreams, which make this solution generally thread-safe and not depending on the global locale. Also it is easy to check for errors and once an fail bit is set, the parsing will be halted inside the suggested methods. For very long sequences of values to parse in a string, this can actually becomes more readable than sscanf: For example it allows to split the parsing cross multiple lines with the preceding mandatory strings being on the same line with the corresponding variables, respectively.T̶h̶e̶ ̶o̶n̶l̶y̶ ̶p̶a̶r̶t̶ ̶t̶h̶a̶t̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶w̶o̶r̶k̶ ̶n̶i̶c̶e̶l̶y̶ ̶w̶i̶t̶h̶ ̶t̶h̶i̶s̶ ̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶i̶s̶ ̶p̶a̶r̶s̶i̶n̶g̶ ̶m̶u̶l̶t̶i̶p̶l̶e̶ ̶h̶e̶x̶a̶d̶e̶c̶i̶m̶a̶l̶s̶ ̶f̶r̶o̶m̶ ̶o̶n̶e̶ ̶s̶t̶r̶i̶n̶g̶,̶ ̶w̶h̶i̶c̶h̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶a̶ ̶s̶e̶c̶o̶n̶d̶ ̶s̶t̶r̶e̶a̶m̶ ̶a̶n̶d̶ ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶d̶d̶i̶t̶i̶o̶n̶a̶l̶ ̶l̶i̶n̶e̶s̶ ̶o̶f̶ ̶c̶o̶d̶e̶ ̶o̶f̶ ̶c̶l̶e̶a̶r̶i̶n̶g̶ ̶a̶n̶d̶ ̶g̶e̶t̶L̶i̶n̶e̶ ̶c̶a̶l̶l̶s̶.̶ After overloading the stream operators << and >> for our internal types, everything looks very clean and is easily maintainable. Parsing multiple hexadecimals also works fine, we just reset the previously set std::hex value to std::dec after the operation is done.

like image 39
Ident Avatar answered Oct 06 '22 23:10

Ident