Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementation of string pattern matching using Suffix Array and LCP(-LR)

During the last weeks I tried to figure out how to efficiently find a string pattern within another string.

I found out that for a long time, the most efficient way would have been using a suffix tree. However, since this data structure is very expensive in space, I studied the use of suffix arrays further (which use far less space). Different papers such as "Suffix Arrays: A new method for on-line string searches" (Manber & Myers, 1993) state, that searching for a substring can be realised in O(P+log(N)) (where P is the length of the pattern and N is length of the string) by using binary search and suffix arrays along with LCP arrays.

I especially studied the latter paper to understand the search algorithm. This answer did a great job in helping me understand the algorithm (and incidentally made it into the LCP Wikipedia Page).

But I am still looking for an way to implement this algorithm. Especially the construction of the mentioned LCP-LR arrays seems very complicated.

References:

Manber & Myers, 1993: Manber, Udi ; Myers, Gene, SIAM Journal on Computing, 1993, Vol.22(5), pp.935-948, http://epubs.siam.org/doi/pdf/10.1137/0222058

UPDATE 1

Just to emphasize on what I am interested in: I understood LCP arrays and I found ways to implement them. However, the "plain" LCP array would not be appropriate for efficient pattern matching (as described in the reference). Thus I am interested in implementing LCP-LR arrays which seems much more complicated than just implementing an LCP array

UPDATE 2

Added link to referenced paper

like image 346
Paddre Avatar asked Jan 04 '15 17:01

Paddre


2 Answers

The termin that can help you: enchanced suffix array, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).

These can be some of the examples:

https://code.google.com/p/esaxx/ ESAXX

http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA

The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.


If you take a look at the referenced paper, it mentions an useful property (4.2). Since SO does not support math, there is no point to copy it here.

I've done quick implementation, it uses segment tree:

// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1

// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
// 

// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]

// rangeI = LCP_LR[i]
//   rangeILeft  = LCP_LR[2 * i]
//   rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
    if(low == high)
    {
        LCP_LR[index] = LCP[low];
        return;
    }

    int mid = (low + high) / 2;

    buildLCP_LR(2*index, low, mid);
    buildLCP_LR(2*index+1, mid + 1, high);

    LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}
like image 183
Erti-Chris Eelmaa Avatar answered Oct 11 '22 20:10

Erti-Chris Eelmaa


Here is a fairly simple implementation in C++, though the build() procedure builds the suffix array in O(N lg^2 N) time. The lcp_compute() procedure has linear complexity. I have used this code in many programming contests, and it has never let me down :)

#include <stdio.h>
#include <string.h>
#include <algorithm>

using namespace std;

const int MAX = 200005;
char str[MAX];
int N, h, sa[MAX], pos[MAX], tmp[MAX], lcp[MAX];

bool compare(int i, int j) {
  if(pos[i] != pos[j]) return pos[i] < pos[j]; // compare by the first h chars
  i += h, j += h; // if prefvious comparing failed, use 2*h chars
  return (i < N && j < N) ? pos[i] < pos[j] : i > j; // return results
}

void build() {
  N = strlen(str);
  for(int i=0; i<N; ++i) sa[i] = i, pos[i] = str[i]; // initialize variables
  for(h=1;;h<<=1) {
    sort(sa, sa+N, compare); // sort suffixes
    for(int i=0; i<N-1; ++i) tmp[i+1] = tmp[i] + compare(sa[i], sa[i+1]); // bucket suffixes
    for(int i=0; i<N; ++i) pos[sa[i]] = tmp[i]; // update pos (reverse mapping of suffix array)
    if(tmp[N-1] == N-1) break; // check if done
  }
}

void lcp_compute() {
  for(int i=0, k=0; i<N; ++i)
    if(pos[i] != N-1) {
      for(int j=sa[pos[i]+1]; str[i+k] == str[j+k];) k++;
      lcp[pos[i]] = k;
      if(k) k--;
    }
}

int main() {
  scanf("%s", str);
  build();
  for(int i=0; i<N; ++i) printf("%d\n", sa[i]);
  return 0;
}

Note: If you want the complexity of the build() procedure to become O(N lg N), you can replace the STL sort with radix sort, but this is going to complicate the code.

Edit: Sorry, I misunderstood your question. Although i haven't implemented string matching with suffix array, I think I can describe you a simple non-standard, but fairly efficient algorithm for string matching. You are given two strings, the text, and the pattern. Given these string you create a new one, lets call it concat, which is the concatenation of the two given strings (first the text, then the pattern). You run the suffix array construction algorithm on concat, and you build the normal lcp array. Then, you search for a suffix of length pattern.size() in the suffix array you just built. Lets call its position in the suffix array pos. You then need two pointers lo and hi. At start lo = hi = pos. You decrease lo while lcp(lo, pos) = pattern.size() and you increase hi while lcp(hi, pos) = pattern.size(). Then you search for a suffix of length at least 2*pattern.size() in the range [lo, hi]. If you find it, you found a match. Otherwise, no match exists.

Edit[2]: I will be back with an implementation as soon as I have one...

Edit[3]:

Here it is:

// It works assuming you have builded the concatenated string and
// computed the suffix and the lcp arrays
// text.length() ---> tlen
// pattern.length() ---> plen
// concatenated string: str

bool match(int tlen, int plen) {
  int total = tlen + plen;
  int pos = -1;
  for(int i=0; i<total; ++i)
    if(total-sa[i] == plen)
      { pos = i; break; }
  if(pos == -1) return false; 
  int lo, hi;
  lo = hi = pos;
  while(lo-1 >= 0 && lcp[lo-1] >= plen) lo--;
  while(hi+1 <  N && lcp[hi] >= plen) hi++;
  for(int i=lo; i<=hi; ++i)
    if(total-sa[i] >= 2*plen)
      return true;
  return false;
}
like image 35
Rontogiannis Aristofanis Avatar answered Oct 11 '22 20:10

Rontogiannis Aristofanis