Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to find all occurrences of a substring?

This is purely out of curiosity. I was browsing through an article comparing various string search algorithms and noticed they were all designed to find the first matching substring. This got me thinking... What if I wanted to find all occurrences of a substring?

I'm sure I could create a loop that used a variant of KMP or BM and dumped each found occurrence into an array but this hardly seems like it would be the fastest.

Wouldn't a divide and conquer algorithm be superior?

For instance lets say your looking for the sequence "abc" in a string "abbcacabbcabcacbccbabc".

  1. On the first pass find all occurrences of the first character and store their positions.
  2. On each additional pass use the positions from the preceding pass to find all occurrences of next character, reducing the candidates for the next pass with each iteration.

Considering the ease with which I came up with this idea I assume someone already came up with it and improved upon it 30 years ago.

like image 864
Kenneth Cochran Avatar asked Oct 12 '09 20:10

Kenneth Cochran


2 Answers

See Suffix array

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.

like image 83
Nick Dandoulakis Avatar answered Oct 02 '22 02:10

Nick Dandoulakis


If you are only processing a given string once, the suffix array is overkill. It takes O(n log n) time to create, so a KMP style algorithm will beat it. Furthermore, if your string is enormous, or you want to get results in real-time as you receive the string, the suffix array won't work.

It is certainly possible to modify the KMP algorithm to keep going after it finds a match without taking additional memory, aside from the memory used to store the matches (which may be unnecessary as well, if you are simply printing out the matches or processing them as you go along). As a start, take the Wikipedia implementation and modify the "return m" statement to "add m to a list of indexes". But you're not done yet. You also need to ask yourself, do you allow overlapping occurrences? For example, if your substring is "abab" and you are looking in the main string "abababab", are there two occurrences or three? In the example I gave ("as a start"), you could either reset i to 0 to give the "two" answer, or you could fall through to the "otherwise" case after the "add m" to give the "three" answer.

like image 36
Martin Hock Avatar answered Oct 02 '22 03:10

Martin Hock