Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest Way to Determine if Character Belongs to a Set of Known Characters C++

Given any character, what's the fastest way for me to determine if that character belongs to a set (not the container-type) of known characters.

In other words, what's the fastest elegant way to implement the conditional:
char c = 'a';
if(c == ch1 || c == ch2 || c == ch3 ...) // Do something...

Is there an STL container (I'm thinking it might be unordered_set?) that I can just pass the character as a key to and it'll return true if the key exists?

Anything with an O(1) lookup time will work for me.

like image 357
Alexander Eden Avatar asked Mar 16 '15 00:03

Alexander Eden


3 Answers

I went a little further and wrote two versions, one based on a lookup array, the other on a set using an underlying hash.

class CharLookup {
public:
  CharLookup(const std::string & set) : lookup(*std::max_element(set.begin(), set.end()) + 1) {
    for ( auto c : set) lookup[c] = true;
  }
  inline bool has(const unsigned char c) const {
    return c > lookup.size() ? false : lookup[c];
  }
private:
  std::vector<bool> lookup;
};

class CharSet {
public:
  CharSet(const std::string & cset) {
    for ( auto c : cset) set.insert(c);
  }
  inline bool has(const unsigned char c) const {
    return set.contains(c);
  }
private:
  QSet<unsigned char> set;
};

Then wrote a little benchmark, added a few more containers for the sake of comparison. Lower is better, the data points are for "character set size / text size":

enter image description here

Seems like for short char sets and text, std::string::find_first_of is fastest, even faster than using a lookup array, but quickly dwindles as the test size increases. std::vector<bool> seems like the "golden mean", QBitArray probably has a little different implementation because it pulls ahead as the test size increases, at the largest test QVector<bool> is fastest, presumably because it doesn't have the overhead of bit access. The two hash sets are close, trading places, last and least there is the std::set.

Tested on an i7-3770k Win7 x64 box, using MinGW 4.9.1 x32 with -O3.

like image 136
dtech Avatar answered Sep 25 '22 23:09

dtech


You could create an array of booleans and assign the value true for each character in wanted set. For example if your wanted set consists of 'a', 'd', 'e':

bool array[256] = {false};
array['a'] = true;
array['d'] = true;
array['e'] = true;

and then you can check a character c:

if (array[c]) ... 

We could also use a bitset for this purpose:

std::bitset<256> b;
b.set('a');
b.set('d');
b.set('e');

and checking as:

if (b.test(c)) ...
like image 25
JuniorCompressor Avatar answered Sep 22 '22 23:09

JuniorCompressor


Keep it simple.

static const char my_chars[] = { 'a', 'b', 'c' };
if (std::find(std::begin(my_chars), std::end(my_chars), char_to_test))

The lookup is linear, but that doesn't tell the whole story. Other data structures may give constant lookup, but the constant can be higher than the maximum "linear" value! If search times with increasing n are O(1) = (100, 100, 100) and O(n) = (10, 20, 30), then you can see that O(n) is faster than O(1) for these small n.

As there are only a small number of characters, I would be very surprised if the simple linear search measures slower than the alternatives in some real code.

If you ensure the array is sorted, you can also try std::binary_search. I don't know if it will be faster or slower for a small number of values.

As always, measure to be sure.

like image 20
Neil Kirk Avatar answered Sep 23 '22 23:09

Neil Kirk