Parsing text consisting of a sequence of integers from a stream in C++ is easy enough: just decode them. When the data is received somehow and is readily available within a program, e.g., receiving a base64 encoded text (the decoding isn't the problem), the situation is a bit different. The data is sitting in a buffer within the program and only needs to be decoded, not read. Of course, a std::istringstream
could be used:
std::vector<int> parse_text(char* begin, char* end) {
std::istringstream in(std::string(begin, end));
return std::vector<int>(std::istream_iterator<int>(in),
std::istream_iterator<int>());
}
Since a lot of these buffers are received and they can be fairly big, it is desirable to not copy the actual content of character array and, ideally, to also avoid creating a stream for each buffer. Thus, the question becomes:
Given a buffer of char
s containing a sequences of (space separated; dealing with other separators is easily done, e.g., using a suitable manipulator) integers how can they be decoded without copying the sequence and, if possible, without creating even an std::istream
?
Avoiding a copy of the buffer is easily done with a custom stream buffer which simply sets of the get area to use the buffer. The stream buffer actually doesn't even need to override any of the virtual functions and would just set up the internal buffer:
class imemstream
: private virtual std::streambuf
, public std::istream
{
public:
imemstream(char* begin, char* end)
: std::streambuf()
, std::istream(static_cast<std::streambuf*>(this))
{
this->setg(begin, begin, end);
}
};
std::vector<int> parse_data_via_istream(char* begin, char* end)
{
imemstream in(begin, end);
return std::vector<int>(std::istream_iterator<int>(in),
std::istream_iterator<int>());
}
This approach avoids copying the stream and uses the ready made std::istream
functionality. However, it does create a stream object. With a suitable update function the stream stream/stream buffer can be extended to reset the buffer and process multiple buffers.
To avoid creation of the stream, the underlying functionality from std::num_get<...>
could be used. The actual parsing is done by one of the std::locale
facets. The numeric parsing for std::istream
is done by std::num_get<char, std::istreambuf_iterator<char>>
. This facet isn't much help as it uses a sequence specified by std::istreambuf_iterator<char>
s but a std::num_get<char, char const*>
facet can be instantiated. It won't be in part of the default std::locale
but it easy to create a corresponding std::locale
and install it, e.g., as the global std::locale
object first thing in main()
:
int main()
{
std::locale::global(std::locale(std::locale(),
new std::num_get<char, char const*>()));
...
Note that the std::locale
object will clean-up the added facet, i.e., there is no need to add any clean-up code: the facets are reference counted and released when the last std::locale
holding a particular facet disappears. To actually use the facet it, unfortunately, needs an std::ios_base
object which is can only really be obtained from some stream object. However, any stream can be used (although in a multi-threaded system it should probably be a separate stream object per stream to avoid accidental race conditions):
char const* skipspace(char const* it, char const* end)
{
return std::find_if(it, end,
[](unsigned char c){ return !std::isspace(c); });
}
std::vector<int> parse_data_via_istream(std::ios_base& fmt,
char const* it, char const* end)
{
std::vector<int> rc;
std::num_get<char, char const*> const& ng
= std::use_facet<std::num_get<char, char const*>>(std::locale());
std::ios_base::iostate error;
for (long tmp;
(it = ng.get(skipspace(it, end), end, fmt, error, tmp))
, error == std::ios_base::goodbit; ) {
rc.push_back(tmp);
}
return rc;
}
Most of this just about a bit of error handling and skipping leading whitespace: mostly, std::istream
provides facilities to automatically skip whitespace for formatted input and deals with the necessary error protocol. There is potentially a small advantage of the approach outlined above with respect to getting the facet just once per buffer and avoiding creation of a std::istream::sentry
object as well as avoiding creation of a stream. Of course, the code assumes that some stream can be used to pass it in as its std::ios_base&
subobject to provide parsing flags like the base to be used.
OK, this is quite a bit of code for something which strtol()
could mostly do, too. The approach using std::num_get<char, char const*>
has some flexibility which isn't offered by strtol()
:
std::locale
's facet are used which can be overridden to parse arbitrary formats of representation, e.g., Roman numerals, it more flexible with respect to input formats.std::numpunct<char>
in std::locale
used by fmt
to set these up).it
and it+8
as the range when calling std::num_get<char, char const*>::get()
.However, strtol()
is probably a good approach for most uses. On the other hand, the above provides an alternative which may be useful in some contexts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With