Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching fast through a sorted list of strings in C++

Tags:

c++

visual-c++

I have a list of about a hundreds unique strings in C++, I need to check if a value exists in this list, but preferrably lightning fast.

I am currenly using a hash_set with std::strings (since I could not get it to work with const char*) like so:

stdext::hash_set<const std::string> _items;
_items.insert("LONG_NAME_A_WITH_SOMETHING");
_items.insert("LONG_NAME_A_WITH_SOMETHING_ELSE");
_items.insert("SHORTER_NAME");
_items.insert("SHORTER_NAME_SPECIAL");

stdext::hash_set<const std::string>::const_iterator it = _items.find( "SHORTER_NAME" ) );

if( it != _items.end() ) {
   std::cout << "item exists" << std::endl;
}

Does anybody else have a good idea for a faster search method without building a complete hashtable myself?


The list is a fixed list of strings which will not change. It contains a list of names of elements which are affected by a certain bug and should be repaired on-the-fly when opened with a newer version.

I've built hashtables before using Aho-Corasick but I'm not really willing to add too much complexity.


I was amazed by the number of answers. I ended up testing a few methods for their performance and ended up using a combination of kirkus and Rob K.'s answers. I had tried a binary search before but I guess I had a small bug implementing it (how hard can it be...).

The results where shocking... I thought I had a fast implementation using a hash_set... well, ends out I did not. Here's some statistics (and the eventual code):

Random lookup of 5 existing keys and 1 non-existant key, 50.000 times

My original algorithm took on average 18,62 seconds
A lineair search took on average 2,49 seconds
A binary search took on average 0,92 seconds.
A search using a perfect hashtable generated by gperf took on average 0,51 seconds.

Here's the code I use now:

bool searchWithBinaryLookup(const std::string& strKey) {
   static const char arrItems[][NUM_ITEMS] = { /* list of items */ };

   /* Binary lookup */
   int low, mid, high;

   low = 0;
   high = NUM_ITEMS;
   while( low < high ) {
      mid = (low + high) / 2;
      if(arrAffectedSymbols[mid] > strKey) {
         high = mid;
      }
      else if(arrAffectedSymbols[mid] < strKey) {
         low = mid + 1;
      }
      else {
         return true;
      }
   }

   return false;
}

NOTE: This is Microsoft VC++ so I'm not using the std::hash_set from SGI.


I did some tests this morning using gperf as VardhanDotNet suggested and this is quite a bit faster indeed.

like image 446
Huppie Avatar asked Jan 26 '09 14:01

Huppie


3 Answers

If your list of strings are fixed at compile time, use gperf http://www.gnu.org/software/gperf/ QUOTE: gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.

The output of gperf is not governed by gpl or lgpl, afaik.

like image 180
vrdhn Avatar answered Nov 15 '22 23:11

vrdhn


You could try a PATRICIA Trie if none of the standard containers meet your needs.

Worst-case lookup is bounded by the length of the string you're looking up. Also, strings share common prefixes so it is really easy on memory.So if you have lots of relatively short strings this could be beneficial.

Check it out here.

Note: PATRICIA = Practical Algorithm to Retrieve Information Coded in Alphanumeric

like image 6
user21714 Avatar answered Nov 15 '22 21:11

user21714


If it's a fixed list, sort the list and do a binary search? I can't imagine, with only a hundred or so strings on a modern CPU, you're really going to see any appreciable difference between algorithms, unless your application is doing nothing but searching said list 100% of the time.

like image 4
kirkus Avatar answered Nov 15 '22 22:11

kirkus