Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to select all strings from list starting from

I'm looking for the fastest way to find all strings in a collection starting from a set of characters. I can use sorted collection for this, however I can't find convenient way to do this in .net. Basically I need to find low and high indexes in a collection that meet the criteria.

BinarySearch on List<T> does not guarantee the returned index is that of the 1st element, so one would need to iterate up and down to find all matching strings which is not fast if one has a large list.

There are also Linq methods (with parallel), but I'm not sure which data structure will provide the best results.

List example, ~10M of records:

aaaaaaaaaaaaaaabb
aaaaaaaaaaaaaaba
aaaaaaaaaaaaabc
...
zzzzzzzzzzzzzxx
zzzzzzzzzzzzzyzzz
zzzzzzzzzzzzzzzzzza

Search for strings starting from: skk...

Result: record indexes from x to y.

UPDATE: strings can have different lengths and are unique.

like image 539
Easy Rider Avatar asked Feb 22 '12 20:02

Easy Rider


3 Answers

In terms of time complexity - you should use a trie, and not a sorted set or binary search.

Trie will get you a O(|S|) time complexity [while sorted set and binary search gets you O(|S|logn)] to find the node [let it be v] that represents that prefix.

All the strings [paths] in the trie that fit the prefix will "pass" via v. By adding numberOfLeaves field to each node, you can find out exactly how much leaves [=strings] this node has.

In a single pass - you can also find the index of this v [For each node u in the path from the root to v - sum numberOfLeaves for each sibling which is left to u].

This requires much more work then using already existing structures, but if the data is huge - it can make your algorithm much faster, so you should concider it if performance is an issue and you expect a huge set of strings.

like image 188
amit Avatar answered Sep 18 '22 23:09

amit


You can do it with a hand-written binary search - one which just doesn't stop when it's found a match; it continues until it's found a single index.

In fact, you don't even have to write the binary search bit yourself - you could create a custom comparer which never returns 0, i.e. if you're looking for "abc" then it treats "abb" as being below the target value, but "abc" as being above the target value. This way the BinarySearch will always return a negative number, which you can then just bit-flip to find the theoretical insertion point for "the string which comes between abb and abc".

You can do the same in reverse (treat "abc" as lower than the target value) to find the highest bound.

If you know the format of these strings and it won't have edge cases like Unicode NULL characters, and everything's the same length, you can even do it without writing your own comparer:

// This could be done more efficiently :)
string stringJustBelow = target.Substring(0, target.Length - 1) +
                         target[target.Length - 1] + "X";
string stringJustAbove = target + "X"; // Or any character

int lowerBoundInclusive = ~list.BinarySearch(stringJustBelow);
int upperBoundExclusive = ~list.BinarySearch(stringJustAbove);

So if you strings are all length 3 and you were searching for "abc" you'd actually look for where "abbX" and "abcX" would be inserted.

like image 35
Jon Skeet Avatar answered Sep 16 '22 23:09

Jon Skeet


Put them in SortedSet and use GetViewBetween.

This answer illustrates searching for both prefix and suffix, I'm sure you'll have no trouble adapting it to prefix-only search, if that is indeed what you want.

If you just want to search for a range (not prefix), directly using GetViewBetween should suffice.

like image 32
Branko Dimitrijevic Avatar answered Sep 20 '22 23:09

Branko Dimitrijevic