hi i would like to understand why the following code which does a split string split using regex
#include<regex>
#include<vector>
#include<string>
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +");
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
auto res = std::vector<std::string>(rit, rend);
return res;
}
int main(){
for(auto i=0; i< 10000; ++i)
split("a b c", " ");
return 0;
}
is slower then the following python code
import re
for i in range(10000):
re.split(' +', 'a b c')
here's
> python test.py 0.05s user 0.01s system 94% cpu 0.070 total
./test 0.26s user 0.00s system 99% cpu 0.296 total
Im using clang++ on osx.
compiling with -O3 brings it to down to 0.09s user 0.00s system 99% cpu 0.109 total
See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.
I've augmented the loop to 1000000 to get a better timing measure.
This is my Python timing:
real 0m2.038s
user 0m2.009s
sys 0m0.024s
Here's an equivalent of your code, just a bit prettier:
#include <regex>
#include <vector>
#include <string>
std::vector<std::string> split(const std::string &s, const std::regex &r)
{
return {
std::sregex_token_iterator(s.begin(), s.end(), r, -1),
std::sregex_token_iterator()
};
}
int main()
{
const std::regex r(" +");
for(auto i=0; i < 1000000; ++i)
split("a b c", r);
return 0;
}
Timing:
real 0m5.786s
user 0m5.779s
sys 0m0.005s
This is an optimization to avoid construction/allocation of vector and string objects:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
auto rend = std::sregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
Timing:
real 0m3.034s
user 0m3.029s
sys 0m0.004s
This is near a 100% performance improvement.
The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by clear()
, the vector maintains the memory and construct strings in-place.
Another performance increase would be to avoid construction/destruction std::string
completely, and hence, allocation/deallocation of its objects.
This is a tentative in this direction:
#include <regex>
#include <vector>
#include <string>
void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
Timing:
real 0m2.509s
user 0m2.503s
sys 0m0.004s
An ultimate improvement would be to have a std::vector
of const char *
as return, where each char pointer would point to a substring inside the original s
c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1y string_ref
in a later sample).
This last improvement could also be achieved with this:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v); // the constant string("a b c") should be optimized
// by the compiler. I got the same performance as
// if it was an object outside the loop
return 0;
}
I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.
This is the boost::regex
timing for the c string arguments sample:
real 0m1.284s
user 0m1.278s
sys 0m0.005s
Same code, boost::regex
and std::regex
interface in this sample are identical, just needed to change the namespace and include.
Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.
For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent std::vector<std::string> &v
version in anything:
#include <regex>
#include <vector>
#include <string>
template<typename Iterator> class intrusive_substring
{
private:
Iterator begin_, end_;
public:
intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}
Iterator begin() {return begin_;}
Iterator end() {return end_;}
};
using intrusive_char_substring = intrusive_substring<const char *>;
void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear(); // This can potentially be optimized away by the compiler because
// the intrusive_char_substring destructor does nothing, so
// resetting the internal size is the only thing to be done.
// Formerly allocated memory is maintained.
while(rit != rend)
{
v.emplace_back(rit->first, rit->second);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<intrusive_char_substring> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
This has to do with the array_ref and string_ref proposal. Here's a sample code using it:
#include <regex>
#include <vector>
#include <string>
#include <string_ref>
void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.emplace_back(rit->first, rit->length());
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string_ref> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
It will also be cheaper to return a vector of string_ref
rather than string
copies for the case of split
with vector return.
This new solution is able to get output by return. I have used Marshall Clow's string_view
(string_ref
got renamed) libc++ implementation found at https://github.com/mclow/string_view.
#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>
using namespace std;
using namespace std::experimental;
using namespace boost;
string_view stringfier(const cregex_token_iterator::value_type &match) {
return {match.first, static_cast<size_t>(match.length())};
}
using string_view_iterator =
transform_iterator<decltype(&stringfier), cregex_token_iterator>;
iterator_range<string_view_iterator> split(string_view s, const regex &r) {
return {
string_view_iterator(
cregex_token_iterator(s.begin(), s.end(), r, -1),
stringfier
),
string_view_iterator()
};
}
int main() {
const regex r(" +");
for (size_t i = 0; i < 1000000; ++i) {
split("a b c", r);
}
}
Timing:
real 0m0.385s
user 0m0.385s
sys 0m0.000s
Note how faster this is compared to previous results. Of course, it's not filling a vector
inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for
, or even use it to fill a vector
.
As ranging over the iterator_range
creates string_view
s over an original string
(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.
Just to compare using this split
implementation but actually filling a vector
we could do this:
int main() {
const regex r(" +");
vector<string_view> v;
v.reserve(10);
for (size_t i = 0; i < 1000000; ++i) {
copy(split("a b c", r), back_inserter(v));
v.clear();
}
}
This uses boost range copy algorithm to fill the vector in each iteration, the timing is:
real 0m1.002s
user 0m0.997s
sys 0m0.004s
As can be seen, no much difference in comparison with the optimized string_view
output param version.
Note also there's a proposal for a std::split
that would work like this.
For optimizations, in general, you want to avoid two things:
The two can be antithetic as sometimes it ends up being faster computing something than caching all of it in memory... so it's a game of balance.
Let's analyze your code:
std::vector<std::string> split(const std::string &s){
static const std::regex rsplit(" +"); // only computed once
// search for first occurrence of rsplit
auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::sregex_token_iterator();
// simultaneously:
// - parses "s" from the second to the past the last occurrence
// - allocates one `std::string` for each match... at least! (there may be a copy)
// - allocates space in the `std::vector`, possibly multiple times
auto res = std::vector<std::string>(rit, rend);
return res;
}
Can we do better ? Well, if we could reuse existing storage instead of keeping allocating and deallocating memory, we should see a significant improvement [1]:
// Overwrites 'result' with the matches, returns the number of matches
// (note: 'result' is never shrunk, but may be grown as necessary)
size_t split(std::string const& s, std::vector<std::string>& result){
static const std::regex rsplit(" +"); // only computed once
auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1);
auto rend = std::cregex_token_iterator();
size_t pos = 0;
// As long as possible, reuse the existing strings (in place)
for (size_t max = result.size();
rit != rend && pos != max;
++rit, ++pos)
{
result[pos].assign(rit->first, rit->second);
}
// When more matches than existing strings, extend capacity
for (; rit != rend; ++rit, ++pos) {
result.emplace_back(rit->first, rit->second);
}
return pos;
} // split
In the test that you perform, where the number of submatches is constant across iterations, this version is unlikely to be beaten: it will only allocate memory on the first run (both for rsplit
and result
) and then keep reusing existing memory.
[1]: Disclaimer, I have only proved this code to be correct I have not tested it (as Donald Knuth would say).
How about this verion? It is no regex, but it solves the split pretty fast ...
#include <vector>
#include <string>
#include <algorithm>
size_t split2(const std::string& s, std::vector<std::string>& result)
{
size_t count = 0;
result.clear();
std::string::const_iterator p1 = s.cbegin();
std::string::const_iterator p2 = p1;
bool run = true;
do
{
p2 = std::find(p1, s.cend(), ' ');
result.push_back(std::string(p1, p2));
++count;
if (p2 != s.cend())
{
p1 = std::find_if(p2, s.cend(), [](char c) -> bool
{
return c != ' ';
});
}
else run = false;
} while (run);
return count;
}
int main()
{
std::vector<std::string> v;
std::string s = "a b c";
for (auto i = 0; i < 100000; ++i)
split2(s, v);
return 0;
}
$ time splittest.exe
real 0m0.132s user 0m0.000s sys 0m0.109s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With