I investigated some time here on StackOverflow to find good algorithms to split strings with multiple delimiters into a vector< string >
. I also found some methods:
The Boost way:
boost::split(vector, string, boost::is_any_of(" \t"));
the getline
method:
std::stringstream ss(string); std::string item; while(std::getline(ss, item, ' ')) { vector.push_back(item); }
the tokenize way of Boost:
char_separator<char> sep(" \t"); tokenizer<char_separator<char>> tokens(string, sep); BOOST_FOREACH(string t, tokens) { vector.push_back(t); }
and the cool STL way:
istringstream iss(string); copy(istream_iterator<string>(iss), istream_iterator<string>(), back_inserter<vector<string> >(vector));
and the method of Shadow2531 (see the linked topic).
Most of them came from this topic. But they unfortunately don't solve my problem:
Boost's split is easy to use but with the big data (about 1.5*10^6 single elements in best cases) and about 10 delimiters I am using it's horrific slow.
The getline
, STL and Shadow2531's method have the problem that I can only use one single char as delimiter. I need a few more.
Boost's tokenize is even more horrific in the aspect of speed. It took 11 seconds with 10 delimiters to split a string into 1.5*10^6 elements.
So I don't know what to do: I want to have a really fast string splitting algorithm with multiple delimiters.
Is Boost's split the maximum or is there a way to do it faster?
Use the String. split() method to split a string with multiple separators, e.g. str. split(/[-_]+/) . The split method can be passed a regular expression containing multiple characters to split the string with multiple separators.
Using String. split() Method. The split() method of the String class is used to split a string into an array of String objects based on the specified delimiter that matches the regular expression.
split() you can specify multiple patterns for the separator. As shown in the solution, the separator is either ahyphen(-), or whitespace( ), or comma(,) followed values.
Two things come to mind:
find
ing into the delimiter characters.Here's a quick attempt at applying these ideas:
#include <vector> #include <bitset> #include <iostream> #include <boost/algorithm/string/split.hpp> #include <boost/algorithm/string/classification.hpp> #include <boost/timer.hpp> using namespace std; size_t const N = 10000000; template<typename C> void test_custom(string const& s, char const* d, C& ret) { C output; bitset<255> delims; while( *d ) { unsigned char code = *d++; delims[code] = true; } typedef string::const_iterator iter; iter beg; bool in_token = false; for( string::const_iterator it = s.begin(), end = s.end(); it != end; ++it ) { if( delims[*it] ) { if( in_token ) { output.push_back(typename C::value_type(beg, it)); in_token = false; } } else if( !in_token ) { beg = it; in_token = true; } } if( in_token ) output.push_back(typename C::value_type(beg, s.end())); output.swap(ret); } template<typename C> void test_strpbrk(string const& s, char const* delims, C& ret) { C output; char const* p = s.c_str(); char const* q = strpbrk(p+1, delims); for( ; q != NULL; q = strpbrk(p, delims) ) { output.push_back(typename C::value_type(p, q)); p = q + 1; } output.swap(ret); } template<typename C> void test_boost(string const& s, char const* delims) { C output; boost::split(output, s, boost::is_any_of(delims)); } int main() { // Generate random text string text(N, ' '); for( size_t i = 0; i != N; ++i ) text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'\t'); char const* delims = " \t[],-'/\\!\"§$%&=()<>?"; // Output strings boost::timer timer; test_boost<vector<string> >(text, delims); cout << "Time: " << timer.elapsed() << endl; // Output string views typedef string::const_iterator iter; typedef boost::iterator_range<iter> string_view; timer.restart(); test_boost<vector<string_view> >(text, delims); cout << "Time: " << timer.elapsed() << endl; // Custom split timer.restart(); vector<string> vs; test_custom(text, delims, vs); cout << "Time: " << timer.elapsed() << endl; // Custom split timer.restart(); vector<string_view> vsv; test_custom(text, delims, vsv); cout << "Time: " << timer.elapsed() << endl; // Custom split timer.restart(); vector<string> vsp; test_strpbrk(text, delims, vsp); cout << "Time: " << timer.elapsed() << endl; // Custom split timer.restart(); vector<string_view> vsvp; test_strpbrk(text, delims, vsvp); cout << "Time: " << timer.elapsed() << endl; return 0; }
Compiling this with Boost 1.46.1 using GCC 4.5.1 with the -O4
flag enabled I get:
NOTE: There's a slight difference in the output as empty tokens are dropped by my custom function. But you can adapt this code to your needs if you decide to use it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With