During the last weeks I tried to figure out how to efficiently find a string pattern within another string.
I found out that for a long time, the most efficient way would have been using a suffix tree. However, since this data structure is very expensive in space, I studied the use of suffix arrays further (which use far less space). Different papers such as "Suffix Arrays: A new method for on-line string searches" (Manber & Myers, 1993) state, that searching for a substring can be realised in O(P+log(N)) (where P is the length of the pattern and N is length of the string) by using binary search and suffix arrays along with LCP arrays.
I especially studied the latter paper to understand the search algorithm. This answer did a great job in helping me understand the algorithm (and incidentally made it into the LCP Wikipedia Page).
But I am still looking for an way to implement this algorithm. Especially the construction of the mentioned LCP-LR arrays seems very complicated.
References:
Manber & Myers, 1993: Manber, Udi ; Myers, Gene, SIAM Journal on Computing, 1993, Vol.22(5), pp.935-948, http://epubs.siam.org/doi/pdf/10.1137/0222058
UPDATE 1
Just to emphasize on what I am interested in: I understood LCP arrays and I found ways to implement them. However, the "plain" LCP array would not be appropriate for efficient pattern matching (as described in the reference). Thus I am interested in implementing LCP-LR arrays which seems much more complicated than just implementing an LCP array
UPDATE 2
Added link to referenced paper
The termin that can help you: enchanced suffix array
, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).
These can be some of the examples:
https://code.google.com/p/esaxx/ ESAXX
http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA
The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.
If you take a look at the referenced paper, it mentions an useful property (4.2)
. Since SO does not support math, there is no point to copy it here.
I've done quick implementation, it uses segment tree:
// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1
// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
//
// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]
// rangeI = LCP_LR[i]
// rangeILeft = LCP_LR[2 * i]
// rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
if(low == high)
{
LCP_LR[index] = LCP[low];
return;
}
int mid = (low + high) / 2;
buildLCP_LR(2*index, low, mid);
buildLCP_LR(2*index+1, mid + 1, high);
LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}
Here is a fairly simple implementation in C++, though the build()
procedure builds the suffix array in O(N lg^2 N)
time. The lcp_compute()
procedure has linear complexity. I have used this code in many programming contests, and it has never let me down :)
#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
const int MAX = 200005;
char str[MAX];
int N, h, sa[MAX], pos[MAX], tmp[MAX], lcp[MAX];
bool compare(int i, int j) {
if(pos[i] != pos[j]) return pos[i] < pos[j]; // compare by the first h chars
i += h, j += h; // if prefvious comparing failed, use 2*h chars
return (i < N && j < N) ? pos[i] < pos[j] : i > j; // return results
}
void build() {
N = strlen(str);
for(int i=0; i<N; ++i) sa[i] = i, pos[i] = str[i]; // initialize variables
for(h=1;;h<<=1) {
sort(sa, sa+N, compare); // sort suffixes
for(int i=0; i<N-1; ++i) tmp[i+1] = tmp[i] + compare(sa[i], sa[i+1]); // bucket suffixes
for(int i=0; i<N; ++i) pos[sa[i]] = tmp[i]; // update pos (reverse mapping of suffix array)
if(tmp[N-1] == N-1) break; // check if done
}
}
void lcp_compute() {
for(int i=0, k=0; i<N; ++i)
if(pos[i] != N-1) {
for(int j=sa[pos[i]+1]; str[i+k] == str[j+k];) k++;
lcp[pos[i]] = k;
if(k) k--;
}
}
int main() {
scanf("%s", str);
build();
for(int i=0; i<N; ++i) printf("%d\n", sa[i]);
return 0;
}
Note: If you want the complexity of the build()
procedure to become O(N lg N)
, you can replace the STL sort with radix sort, but this is going to complicate the code.
Edit: Sorry, I misunderstood your question. Although i haven't implemented string matching with suffix array, I think I can describe you a simple non-standard, but fairly efficient algorithm for string matching. You are given two strings, the text
, and the pattern
. Given these string you create a new one, lets call it concat
, which is the concatenation of the two given strings (first the text
, then the pattern
). You run the suffix array construction algorithm on concat
, and you build the normal lcp array. Then, you search for a suffix of length pattern.size()
in the suffix array you just built. Lets call its position in the suffix array pos
. You then need two pointers lo
and hi
. At start lo = hi = pos
. You decrease lo
while lcp(lo, pos) = pattern.size()
and you increase hi
while lcp(hi, pos) = pattern.size()
. Then you search for a suffix of length at least 2*pattern.size()
in the range [lo, hi]
. If you find it, you found a match. Otherwise, no match exists.
Edit[2]: I will be back with an implementation as soon as I have one...
Edit[3]:
Here it is:
// It works assuming you have builded the concatenated string and
// computed the suffix and the lcp arrays
// text.length() ---> tlen
// pattern.length() ---> plen
// concatenated string: str
bool match(int tlen, int plen) {
int total = tlen + plen;
int pos = -1;
for(int i=0; i<total; ++i)
if(total-sa[i] == plen)
{ pos = i; break; }
if(pos == -1) return false;
int lo, hi;
lo = hi = pos;
while(lo-1 >= 0 && lcp[lo-1] >= plen) lo--;
while(hi+1 < N && lcp[hi] >= plen) hi++;
for(int i=lo; i<=hi; ++i)
if(total-sa[i] >= 2*plen)
return true;
return false;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With