Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a sequence of integers stored in a text buffer?

Tags:

c++

Parsing text consisting of a sequence of integers from a stream in C++ is easy enough: just decode them. When the data is received somehow and is readily available within a program, e.g., receiving a base64 encoded text (the decoding isn't the problem), the situation is a bit different. The data is sitting in a buffer within the program and only needs to be decoded, not read. Of course, a std::istringstream could be used:

std::vector<int> parse_text(char* begin, char* end) {
    std::istringstream in(std::string(begin, end));
    return std::vector<int>(std::istream_iterator<int>(in),
                            std::istream_iterator<int>());
}

Since a lot of these buffers are received and they can be fairly big, it is desirable to not copy the actual content of character array and, ideally, to also avoid creating a stream for each buffer. Thus, the question becomes:

Given a buffer of chars containing a sequences of (space separated; dealing with other separators is easily done, e.g., using a suitable manipulator) integers how can they be decoded without copying the sequence and, if possible, without creating even an std::istream?

like image 941
Dietmar Kühl Avatar asked Apr 12 '14 22:04

Dietmar Kühl


1 Answers

Avoiding a copy of the buffer is easily done with a custom stream buffer which simply sets of the get area to use the buffer. The stream buffer actually doesn't even need to override any of the virtual functions and would just set up the internal buffer:

class imemstream
    : private virtual std::streambuf
    , public std::istream
{
public:
    imemstream(char* begin, char* end)
        : std::streambuf()
        , std::istream(static_cast<std::streambuf*>(this))
    {
        this->setg(begin, begin, end); 
    }
};

std::vector<int> parse_data_via_istream(char* begin, char* end)
{
    imemstream in(begin, end);
    return std::vector<int>(std::istream_iterator<int>(in),
                            std::istream_iterator<int>());
}

This approach avoids copying the stream and uses the ready made std::istream functionality. However, it does create a stream object. With a suitable update function the stream stream/stream buffer can be extended to reset the buffer and process multiple buffers.

To avoid creation of the stream, the underlying functionality from std::num_get<...> could be used. The actual parsing is done by one of the std::locale facets. The numeric parsing for std::istream is done by std::num_get<char, std::istreambuf_iterator<char>>. This facet isn't much help as it uses a sequence specified by std::istreambuf_iterator<char>s but a std::num_get<char, char const*> facet can be instantiated. It won't be in part of the default std::locale but it easy to create a corresponding std::locale and install it, e.g., as the global std::locale object first thing in main():

int main()
{
    std::locale::global(std::locale(std::locale(),
                                    new std::num_get<char, char const*>()));
    ...

Note that the std::locale object will clean-up the added facet, i.e., there is no need to add any clean-up code: the facets are reference counted and released when the last std::locale holding a particular facet disappears. To actually use the facet it, unfortunately, needs an std::ios_base object which is can only really be obtained from some stream object. However, any stream can be used (although in a multi-threaded system it should probably be a separate stream object per stream to avoid accidental race conditions):

char const* skipspace(char const* it, char const* end)
{
    return std::find_if(it, end,
                        [](unsigned char c){ return !std::isspace(c); });
}

std::vector<int> parse_data_via_istream(std::ios_base& fmt,
                                        char const* it, char const* end)
{
    std::vector<int> rc;
    std::num_get<char, char const*> const& ng
        = std::use_facet<std::num_get<char, char const*>>(std::locale());

    std::ios_base::iostate error;
    for (long tmp;
         (it = ng.get(skipspace(it, end), end, fmt, error, tmp))
             , error == std::ios_base::goodbit; ) {
        rc.push_back(tmp);
    }

    return rc;
}

Most of this just about a bit of error handling and skipping leading whitespace: mostly, std::istream provides facilities to automatically skip whitespace for formatted input and deals with the necessary error protocol. There is potentially a small advantage of the approach outlined above with respect to getting the facet just once per buffer and avoiding creation of a std::istream::sentry object as well as avoiding creation of a stream. Of course, the code assumes that some stream can be used to pass it in as its std::ios_base& subobject to provide parsing flags like the base to be used.

OK, this is quite a bit of code for something which strtol() could mostly do, too. The approach using std::num_get<char, char const*> has some flexibility which isn't offered by strtol():

  1. Since the std::locale's facet are used which can be overridden to parse arbitrary formats of representation, e.g., Roman numerals, it more flexible with respect to input formats.
  2. It is easy to set up use of thousands separators or change the representation of the decimal point (just change std::numpunct<char> in std::locale used by fmt to set these up).
  3. The buffer doesn't have to be null-terminated. For example, a contiguous sequence of character made up of 8 digit values can be parsed by feeding it and it+8 as the range when calling std::num_get<char, char const*>::get().

However, strtol() is probably a good approach for most uses. On the other hand, the above provides an alternative which may be useful in some contexts.

like image 59
Dietmar Kühl Avatar answered Oct 23 '22 10:10

Dietmar Kühl