Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ fastest data structure for multiple searches

Coding in C++. I need a data structure for a bunch of sorted strings. I will be inserting all the strings into it in one go and not update it, but I will be searching for strings very often. All I need to see if a give string exists in the structure or not. I am expecting the list to be about a 100 strings. What would be a faster structure? I was thinking hashmap at first but I saw somewhere that for such a small number of elements, a binary search over a vector would work better (since they're sorted).

like image 616
Kavish Munjal Avatar asked Nov 16 '16 20:11

Kavish Munjal


People also ask

Which data structure is faster for searching?

The best data structure for faster searching of string is TRIE. Tries are an extremely special and useful data-structure that are based on the prefix of a string. They are used to represent the “Retrieval” of data. A Trie is a special data structure used to store strings that can be visualized like a graph.

Which data structure is best for search operation?

Linear Search It is the simplest search algorithm in data structure and checks each item in the set of elements until it matches the searched element till the end of data collection. When the given data is unsorted, a linear search algorithm is preferred over other search algorithms.

What is the best data structure for fast retrieval of data?

Hash Table. Hashtable is a list of paired values, the first item in the pair is the key, and the second item is the value. With a hash table, you can access objects by the key, so this structure is high-speed for lookups. Hash tables are faster than the arrays for lookups.

Which data structure is best for searching in C#?

The trie data structure is one alternative method for searching text that can be more efficient than traditional search approaches. Here's a trie class you can create in C#.


2 Answers

Assuming you are talking about "full sized" CPUs1, a binary search through strings, even with only 100 elements is likely to be quite slow, relative to other solutions at least. You may suffer several branch mispredicts for every search, and will probably end up examining each character in the input string several times (since you need to repeatedly strcmp at each node in the binary search).

As someone already pointed out, the only real way to know is to measure - but to do that you still need to be able to figure out what the candidates are in the first place! Furthermore, it's not always possible to measure in a realistic scenario, since might not even know such a scenario (imagine, for example, designing a library function which be widely used across many different cases).

Finally, understanding what is likely to be fast lets you both eliminate candidates you know will perform badly, and allows you to double-check your test results with your intuition: if something is much slower than you expected, it's worth checking why (did the compiler do something stupid), and if something is much faster then maybe it's time to update your intuition.

So I'll try to actually take a stab at what's going to be fast - assuming speed really matters here, and you can spend some time validating a complex solution. As a baseline, a straightforward implementation will probably take 100 ns, and a really optimized one perhaps 10 ns. So if you spend 10 hours of engineering time on this, you'll have to call this function 400 billion times just to earn your 10 hours back5. When you factor in the risk of bugs, the maintenance complexity and other overheads, you are going to want to make sure you are calling this function many trillions of times before trying to optimize it. Such functions are rare, but they certainly exist4.

That said, you are missing a lot of information that would be needed to help design a very fast solution, such as:

  1. Is your input to the search function a std::string or const char * or something else?
  2. What is the average and maximum string length?
  3. Will most of your searches be successful or unsuccessful?
  4. Can you accept some false positives?
  5. Is the set of strings known at compile time, or are you OK with a long initialization phase?

The answers to above can help you partition the design space as described below.

Bloom Filters

If per (4) you can accept a (controllable) number of false positives2, or per (3) most of your searches will be unsuccessful, then you should consider a Bloom Filter. For example, you could use a 1024 bit (128 byte) filter, and use a 60-bit hash of the string to index into it with 6 10-bit functions. This gives a < 1% false positive rate.

This has the advantage that outside of the hash calculation it's independent of the lengths of the strings, and it doesn't depend on the matching behavior (e.g., a search that relies on repeated string comparison will be slower if the strings tend to have long common prefixes).

If you can accept false positives, you are done - but in the case that you need it to be always correct but expect mostly unsuccessful searches, you use it as a filter: if the bloom filter returns false (the usual case) you are done, but if it returns true, you need to double check in one of the always-correct structures discussed below. So the common case is fast, but the right answer is always returned.

Perfect Hash

If the set of ~100 strings is known at compile time, or you are OK doing some one-time heavy work to pre-process the strings, you could consider a perfect hash. If you have a compile-time known search set, you can just slap the strings into gperf and it will spit out a hash function and lookup table.

For example, I just fed 100 random English words3 into gperf and it generated a hash function that only needs to look at two characters to uniquely distinguish every word, like this:

static unsigned int hash (const char *str, unsigned int len)
{
  static unsigned char asso_values[] =
    {
      115, 115, 115, 115, 115,  81,  48,   1,  77,  72,
      115,  38,  81, 115, 115,   0,  73,  40,  44, 115,
       32, 115,  41,  14,   3, 115, 115,  30, 115, 115,
      115, 115, 115, 115, 115, 115, 115,  16,  18,   4,
       31,  55,  13,  74,  51,  44,  32,  20,   4,  28,
       45,   4,  19,  64,  34,   0,  21,   9,  40,  70,
       16,   0, 115, 115, 115, 115, 115, 115, 115, 115,
      /* most of the table omitted */
    };
  register int hval = len;

  switch (hval)
    {
      default:
        hval += asso_values[(unsigned char)str[3]+1];
      /*FALLTHROUGH*/
      case 3:
      case 2:
      case 1:
        hval += asso_values[(unsigned char)str[0]];
        break;
    }
  return hval;
}

Now your hash function is quick and probably well-predicted (if you don't have too many strings of length 3 or less). To look up a string then you just index into the hash table (also generated by gperf), and compare what you get to the input string.

Under some reasonable assumptions this is going to be about as fast as you can get - clang generates code like this:

in_word_set:                            # @in_word_set
        push    rbx
        lea     eax, [rsi - 3]
        xor     ebx, ebx
        cmp     eax, 19
        ja      .LBB0_7
        lea     ecx, [rsi - 1]
        mov     eax, 3
        cmp     ecx, 3
        jb      .LBB0_3
        movzx   eax, byte ptr [rdi + 3]
        movzx   eax, byte ptr [rax + hash.asso_values+1]
        add     eax, esi
.LBB0_3:
        movzx   ecx, byte ptr [rdi]
        movzx   edx, byte ptr [rcx + hash.asso_values]
        cdqe
        add     rax, rdx
        cmp     eax, 114
        ja      .LBB0_6
        mov     rbx, qword ptr [8*rax + in_word_set.wordlist]
        cmp     cl, byte ptr [rbx]
        jne     .LBB0_6
        add     rdi, 1
        lea     rsi, [rbx + 1]
        call    strcmp
        test    eax, eax
        je      .LBB0_7
.LBB0_6:
        xor     ebx, ebx
.LBB0_7:
        mov     rax, rbx
        pop     rbx
        ret

It's a ton of code, but with a reasonable amount of ILP. The critical path is through the 3 dependent memory accesses (look up char value in str -> look up hash value for char in the hash function table -> look up the string in the actual hash table), you could expect this to take perhaps 20 cycles typically (plus the strcmp time of course).

Trie

The "classic" compsci solution to this problem is the trie. The trie could be a reasonable approach for your problem, especially many unsuccessful matches can be rejected quickly within the first few characters (this depends on largely on the content of the match set and the strings you are checking).

You'd want a fast trie implementation to make this work. Overall I feel this approach will be limited by the serially dependent memory accesses - each node is likely visited in a kind of pointer-chasing approach, so you'll suffer a lot of from L1 access latency.

Optimizing strcmp

Almost all of the above solutions depend on strcmp at some point - the exception being the bloom filter which allows false positives. So you want to make this sure this part of your code is fast.

In particular compilers may sometimes inline "builtin" versions of strcmp rather than calling the library function: in quick test icc did the inlining, but clang and gcc opted to call the library function. There is no simple rule for which one is going to be faster, but in general the library routines are often SIMD optimized, and may be faster for long strings, while the inlined versions avoid function call overhead and may be faster for short strings. You can test both approaches and mostly force the compilers to do what is faster in your case.

Even better, you may be able to take advantage of your control of the inputs to do much better - if you can ensure that, for example, the input strings will be null padded such that its length is a multiple of 8, then you can do the same for the reference strings in your hash table (or whatever other structure) and you could compare the strings 8 bytes at a time. This not only greatly speeds up the matching, it cuts back dramatically on branch mispredicts because it essentially quantizes the looping behavior (all strings of 1-8 characters loop once, etc).


1 Here I mean desktop, server, laptop CPUs, or even modern smartphone CPUs and not embedded device MCUs or anything like that.

2 Allowing false positives means that it's OK if your "is in set" sometimes returns true even when the input string is not in the set. Note that it never gets it wrong the other way around: it always returns true when the string is in the set - there are no false negatives.

3 Specifically, awk 'NR%990==0' /usr/share/dict/american-english > words.

4 For example, how many times do you thing strcmp has been called in the history of computing? How much time would be saved if it were even 1 ns faster?

5 That's kind of somehow equating CPU time with engineering time, which is probably off by a factor of more than 1000x: Amazon AWS charges something like $0.02 per hour of CPU time, and a good engineer can expect perhaps $50 per hour (in the first world). So by that (very rough!) metric engineering time is 2500x more valuable than CPU time. So perhaps you need quadrillions of calls for 10 hours of work to pay off...

like image 74
BeeOnRope Avatar answered Oct 31 '22 10:10

BeeOnRope


The best (and only) way to tell what structure is fastest for a certain situation is to actually benchmark/measure it with different data structures. Then pick the fastest.

Or in other words: Measuring your code gives you an advantage over those people who think they are too smart to measure. ;)

For rather small lists like the 100 elements you mentioned in your question it is not making much of a difference what structure/algorithm you use, because the time gained is probably negligible - unless that search is performed very often by your program.

like image 42
Striezel Avatar answered Oct 31 '22 08:10

Striezel