Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string splitting with multiple delimiters

I investigated some time here on StackOverflow to find good algorithms to split strings with multiple delimiters into a vector< string >. I also found some methods:

The Boost way:

boost::split(vector, string, boost::is_any_of(" \t")); 

the getline method:

std::stringstream ss(string); std::string item; while(std::getline(ss, item, ' ')) {     vector.push_back(item); } 

the tokenize way of Boost:

char_separator<char> sep(" \t"); tokenizer<char_separator<char>> tokens(string, sep); BOOST_FOREACH(string t, tokens) {    vector.push_back(t); } 

and the cool STL way:

     istringstream iss(string);      copy(istream_iterator<string>(iss),      istream_iterator<string>(),      back_inserter<vector<string> >(vector)); 

and the method of Shadow2531 (see the linked topic).

Most of them came from this topic. But they unfortunately don't solve my problem:

  • Boost's split is easy to use but with the big data (about 1.5*10^6 single elements in best cases) and about 10 delimiters I am using it's horrific slow.

  • The getline, STL and Shadow2531's method have the problem that I can only use one single char as delimiter. I need a few more.

  • Boost's tokenize is even more horrific in the aspect of speed. It took 11 seconds with 10 delimiters to split a string into 1.5*10^6 elements.

So I don't know what to do: I want to have a really fast string splitting algorithm with multiple delimiters.

Is Boost's split the maximum or is there a way to do it faster?

like image 800
Paul Avatar asked Mar 31 '11 20:03

Paul


People also ask

How do you split with multiple separators?

Use the String. split() method to split a string with multiple separators, e.g. str. split(/[-_]+/) . The split method can be passed a regular expression containing multiple characters to split the string with multiple separators.

How do you split a string with two delimiters?

Using String. split() Method. The split() method of the String class is used to split a string into an array of String objects based on the specified delimiter that matches the regular expression.

Can we pass multiple separators in split () python?

split() you can specify multiple patterns for the separator. As shown in the solution, the separator is either ahyphen(-), or whitespace( ), or comma(,) followed values.


1 Answers

Two things come to mind:

  1. Use string views instead of strings as the split result, saves a lot of allocations.
  2. If you know you're only going to be working with chars (in the [0,255] range), try using a bitset to test membership instead of finding into the delimiter characters.

Here's a quick attempt at applying these ideas:

#include <vector> #include <bitset> #include <iostream> #include <boost/algorithm/string/split.hpp> #include <boost/algorithm/string/classification.hpp> #include <boost/timer.hpp>  using namespace std; size_t const N = 10000000;  template<typename C> void test_custom(string const& s, char const* d, C& ret) {   C output;    bitset<255> delims;   while( *d )   {     unsigned char code = *d++;     delims[code] = true;   }   typedef string::const_iterator iter;   iter beg;   bool in_token = false;   for( string::const_iterator it = s.begin(), end = s.end();     it != end; ++it )   {     if( delims[*it] )     {       if( in_token )       {         output.push_back(typename C::value_type(beg, it));         in_token = false;       }     }     else if( !in_token )     {       beg = it;       in_token = true;     }   }   if( in_token )     output.push_back(typename C::value_type(beg, s.end()));   output.swap(ret); }  template<typename C> void test_strpbrk(string const& s, char const* delims, C& ret) {   C output;    char const* p = s.c_str();   char const* q = strpbrk(p+1, delims);   for( ; q != NULL; q = strpbrk(p, delims) )   {     output.push_back(typename C::value_type(p, q));     p = q + 1;   }    output.swap(ret); }  template<typename C> void test_boost(string const& s, char const* delims) {   C output;   boost::split(output, s, boost::is_any_of(delims)); }  int main() {   // Generate random text   string text(N, ' ');   for( size_t i = 0; i != N; ++i )     text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'\t');    char const* delims = " \t[],-'/\\!\"§$%&=()<>?";    // Output strings   boost::timer timer;   test_boost<vector<string> >(text, delims);   cout << "Time: " << timer.elapsed() << endl;    // Output string views   typedef string::const_iterator iter;   typedef boost::iterator_range<iter> string_view;   timer.restart();   test_boost<vector<string_view> >(text, delims);   cout << "Time: " << timer.elapsed() << endl;    // Custom split   timer.restart();   vector<string> vs;   test_custom(text, delims, vs);   cout << "Time: " << timer.elapsed() << endl;    // Custom split   timer.restart();   vector<string_view> vsv;   test_custom(text, delims, vsv);   cout << "Time: " << timer.elapsed() << endl;    // Custom split   timer.restart();   vector<string> vsp;   test_strpbrk(text, delims, vsp);   cout << "Time: " << timer.elapsed() << endl;    // Custom split   timer.restart();   vector<string_view> vsvp;   test_strpbrk(text, delims, vsvp);   cout << "Time: " << timer.elapsed() << endl;    return 0; } 

Compiling this with Boost 1.46.1 using GCC 4.5.1 with the -O4 flag enabled I get:

  • Time: 5.951 (Boost.Split + vector)
  • Time: 3.728 (Boost.Split + vector
  • Time: 1.662 (Custom split + vector)
  • Time: 0.144 (Custom split + vector)
  • Time: 2.13 (Strpbrk + vector)
  • Time: 0.527 (Strpbrk + vector)

NOTE: There's a slight difference in the output as empty tokens are dropped by my custom function. But you can adapt this code to your needs if you decide to use it.

like image 169
Pablo Avatar answered Sep 22 '22 09:09

Pablo