This is purely out of curiosity. I was browsing through an article comparing various string search algorithms and noticed they were all designed to find the first matching substring. This got me thinking... What if I wanted to find all occurrences of a substring?
I'm sure I could create a loop that used a variant of KMP or BM and dumped each found occurrence into an array but this hardly seems like it would be the fastest.
Wouldn't a divide and conquer algorithm be superior?
For instance lets say your looking for the sequence "abc" in a string "abbcacabbcabcacbccbabc".
Considering the ease with which I came up with this idea I assume someone already came up with it and improved upon it 30 years ago.
See Suffix array
Applications
The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.
If you are only processing a given string once, the suffix array is overkill. It takes O(n log n) time to create, so a KMP style algorithm will beat it. Furthermore, if your string is enormous, or you want to get results in real-time as you receive the string, the suffix array won't work.
It is certainly possible to modify the KMP algorithm to keep going after it finds a match without taking additional memory, aside from the memory used to store the matches (which may be unnecessary as well, if you are simply printing out the matches or processing them as you go along). As a start, take the Wikipedia implementation and modify the "return m" statement to "add m to a list of indexes". But you're not done yet. You also need to ask yourself, do you allow overlapping occurrences? For example, if your substring is "abab" and you are looking in the main string "abababab", are there two occurrences or three? In the example I gave ("as a start"), you could either reset i to 0 to give the "two" answer, or you could fall through to the "otherwise" case after the "add m" to give the "three" answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With