Coding in C++. I need a data structure for a bunch of sorted strings. I will be inserting all the strings into it in one go and not update it, but I will be searching for strings very often. All I need to see if a give string exists in the structure or not. I am expecting the list to be about a 100 strings. What would be a faster structure? I was thinking hashmap at first but I saw somewhere that for such a small number of elements, a binary search over a vector would work better (since they're sorted).
The best data structure for faster searching of string is TRIE. Tries are an extremely special and useful data-structure that are based on the prefix of a string. They are used to represent the “Retrieval” of data. A Trie is a special data structure used to store strings that can be visualized like a graph.
Linear Search It is the simplest search algorithm in data structure and checks each item in the set of elements until it matches the searched element till the end of data collection. When the given data is unsorted, a linear search algorithm is preferred over other search algorithms.
Hash Table. Hashtable is a list of paired values, the first item in the pair is the key, and the second item is the value. With a hash table, you can access objects by the key, so this structure is high-speed for lookups. Hash tables are faster than the arrays for lookups.
The trie data structure is one alternative method for searching text that can be more efficient than traditional search approaches. Here's a trie class you can create in C#.
Assuming you are talking about "full sized" CPUs1, a binary search through strings, even with only 100 elements is likely to be quite slow, relative to other solutions at least. You may suffer several branch mispredicts for every search, and will probably end up examining each character in the input string several times (since you need to repeatedly strcmp
at each node in the binary search).
As someone already pointed out, the only real way to know is to measure - but to do that you still need to be able to figure out what the candidates are in the first place! Furthermore, it's not always possible to measure in a realistic scenario, since might not even know such a scenario (imagine, for example, designing a library function which be widely used across many different cases).
Finally, understanding what is likely to be fast lets you both eliminate candidates you know will perform badly, and allows you to double-check your test results with your intuition: if something is much slower than you expected, it's worth checking why (did the compiler do something stupid), and if something is much faster then maybe it's time to update your intuition.
So I'll try to actually take a stab at what's going to be fast - assuming speed really matters here, and you can spend some time validating a complex solution. As a baseline, a straightforward implementation will probably take 100 ns, and a really optimized one perhaps 10 ns. So if you spend 10 hours of engineering time on this, you'll have to call this function 400 billion times just to earn your 10 hours back5. When you factor in the risk of bugs, the maintenance complexity and other overheads, you are going to want to make sure you are calling this function many trillions of times before trying to optimize it. Such functions are rare, but they certainly exist4.
That said, you are missing a lot of information that would be needed to help design a very fast solution, such as:
std::string
or const char *
or something else?The answers to above can help you partition the design space as described below.
If per (4) you can accept a (controllable) number of false positives2, or per (3) most of your searches will be unsuccessful, then you should consider a Bloom Filter. For example, you could use a 1024 bit (128 byte) filter, and use a 60-bit hash of the string to index into it with 6 10-bit functions. This gives a < 1% false positive rate.
This has the advantage that outside of the hash calculation it's independent of the lengths of the strings, and it doesn't depend on the matching behavior (e.g., a search that relies on repeated string comparison will be slower if the strings tend to have long common prefixes).
If you can accept false positives, you are done - but in the case that you need it to be always correct but expect mostly unsuccessful searches, you use it as a filter: if the bloom filter returns false (the usual case) you are done, but if it returns true, you need to double check in one of the always-correct structures discussed below. So the common case is fast, but the right answer is always returned.
If the set of ~100 strings is known at compile time, or you are OK doing some one-time heavy work to pre-process the strings, you could consider a perfect hash. If you have a compile-time known search set, you can just slap the strings into gperf
and it will spit out a hash function and lookup table.
For example, I just fed 100 random English words3 into gperf
and it generated a hash function that only needs to look at two characters to uniquely distinguish every word, like this:
static unsigned int hash (const char *str, unsigned int len)
{
static unsigned char asso_values[] =
{
115, 115, 115, 115, 115, 81, 48, 1, 77, 72,
115, 38, 81, 115, 115, 0, 73, 40, 44, 115,
32, 115, 41, 14, 3, 115, 115, 30, 115, 115,
115, 115, 115, 115, 115, 115, 115, 16, 18, 4,
31, 55, 13, 74, 51, 44, 32, 20, 4, 28,
45, 4, 19, 64, 34, 0, 21, 9, 40, 70,
16, 0, 115, 115, 115, 115, 115, 115, 115, 115,
/* most of the table omitted */
};
register int hval = len;
switch (hval)
{
default:
hval += asso_values[(unsigned char)str[3]+1];
/*FALLTHROUGH*/
case 3:
case 2:
case 1:
hval += asso_values[(unsigned char)str[0]];
break;
}
return hval;
}
Now your hash function is quick and probably well-predicted (if you don't have too many strings of length 3 or less). To look up a string then you just index into the hash table (also generated by gperf
), and compare what you get to the input string.
Under some reasonable assumptions this is going to be about as fast as you can get - clang
generates code like this:
in_word_set: # @in_word_set
push rbx
lea eax, [rsi - 3]
xor ebx, ebx
cmp eax, 19
ja .LBB0_7
lea ecx, [rsi - 1]
mov eax, 3
cmp ecx, 3
jb .LBB0_3
movzx eax, byte ptr [rdi + 3]
movzx eax, byte ptr [rax + hash.asso_values+1]
add eax, esi
.LBB0_3:
movzx ecx, byte ptr [rdi]
movzx edx, byte ptr [rcx + hash.asso_values]
cdqe
add rax, rdx
cmp eax, 114
ja .LBB0_6
mov rbx, qword ptr [8*rax + in_word_set.wordlist]
cmp cl, byte ptr [rbx]
jne .LBB0_6
add rdi, 1
lea rsi, [rbx + 1]
call strcmp
test eax, eax
je .LBB0_7
.LBB0_6:
xor ebx, ebx
.LBB0_7:
mov rax, rbx
pop rbx
ret
It's a ton of code, but with a reasonable amount of ILP. The critical path is through the 3 dependent memory accesses (look up char
value in str
-> look up hash value for char
in the hash function table -> look up the string in the actual hash table), you could expect this to take perhaps 20 cycles typically (plus the strcmp
time of course).
The "classic" compsci solution to this problem is the trie. The trie could be a reasonable approach for your problem, especially many unsuccessful matches can be rejected quickly within the first few characters (this depends on largely on the content of the match set and the strings you are checking).
You'd want a fast trie implementation to make this work. Overall I feel this approach will be limited by the serially dependent memory accesses - each node is likely visited in a kind of pointer-chasing approach, so you'll suffer a lot of from L1 access latency.
strcmp
Almost all of the above solutions depend on strcmp
at some point - the exception being the bloom filter which allows false positives. So you want to make this sure this part of your code is fast.
In particular compilers may sometimes inline "builtin" versions of strcmp
rather than calling the library function: in quick test icc
did the inlining, but clang
and gcc
opted to call the library function. There is no simple rule for which one is going to be faster, but in general the library routines are often SIMD optimized, and may be faster for long strings, while the inlined versions avoid function call overhead and may be faster for short strings. You can test both approaches and mostly force the compilers to do what is faster in your case.
Even better, you may be able to take advantage of your control of the inputs to do much better - if you can ensure that, for example, the input strings will be null padded such that its length is a multiple of 8, then you can do the same for the reference strings in your hash table (or whatever other structure) and you could compare the strings 8 bytes at a time. This not only greatly speeds up the matching, it cuts back dramatically on branch mispredicts because it essentially quantizes the looping behavior (all strings of 1-8 characters loop once, etc).
1 Here I mean desktop, server, laptop CPUs, or even modern smartphone CPUs and not embedded device MCUs or anything like that.
2 Allowing false positives means that it's OK if your "is in set" sometimes returns true even when the input string is not in the set. Note that it never gets it wrong the other way around: it always returns true when the string is in the set - there are no false negatives.
3 Specifically, awk 'NR%990==0' /usr/share/dict/american-english > words
.
4 For example, how many times do you thing strcmp
has been called in the history of computing? How much time would be saved if it were even 1 ns faster?
5 That's kind of somehow equating CPU time with engineering time, which is probably off by a factor of more than 1000x: Amazon AWS charges something like $0.02 per hour of CPU time, and a good engineer can expect perhaps $50 per hour (in the first world). So by that (very rough!) metric engineering time is 2500x more valuable than CPU time. So perhaps you need quadrillions of calls for 10 hours of work to pay off...
The best (and only) way to tell what structure is fastest for a certain situation is to actually benchmark/measure it with different data structures. Then pick the fastest.
Or in other words: Measuring your code gives you an advantage over those people who think they are too smart to measure. ;)
For rather small lists like the 100 elements you mentioned in your question it is not making much of a difference what structure/algorithm you use, because the time gained is probably negligible - unless that search is performed very often by your program.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With