Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kasai Algorithm for Constructing LCP-Array Practical Example

I am attempting to complete the Algorithm's on Strings course on Coursera and am stuck on the method to construct an LCP array described in this video:

https://www.coursera.org/learn/algorithms-on-strings/lecture/HyUlH/computing-the-lcp-array

I am having difficulty understanding the theory being presented in this video. From my own research (Googling) I figure what they are describing is Kasai's algorithm. But much like the video all the explanations use very abstract explanations or large code samplets. Without understanding the theory I find the code examples hard to understand. I'm simply trying to find an explanation using a real world example.

ie: S=ababaa$ what are the steps using Kasai's algorithm that results in the final LCP array.

like image 213
roadrunner007 Avatar asked Sep 01 '19 19:09

roadrunner007


People also ask

What is Kasai algorithm?

Kasai's Algorithm is used to get the Longest Common Prefix (LCP) array from suffix array. At first suffix arrays are found. After that Kasai's algorithm takes the suffix array list to find LCP. For the LCP array, it takes O(m log n), where m is pattern length and n is the length of the text.

What is LCP in suffix array?

In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

What is DC3 algorithm?

The algorithm DC3 sorts suffixes with starting positions in a difference cover sample modulo 3 and then uses these to sort all suffixes. In this section, we present a generalized algorithm DC that can use any difference cover D modulo v. Step 0: Construct a sample. For k ∈ [0,v), define Bk = {i ∈ [0,n] | i mod v = k}.

How do you make a suffix array in linear time?

The even array is constructed from in linear time as follows. Sort array : An even suffix is one symbol followed by an odd suffix. For example, of T is followed by of T. We make tuples for even suffixes: the first element of a tuple is T [ 2 i ] and the second element is suffix S 2 i + 1 of T.


1 Answers

Preliminaries

I am using the algorithm by Kasai et al. as defined in this paper in Figure 3. Note the paper has a typo where the j occurring out of nowhere should have been a k. I changed the indexing to go from 0 to n-1 instead of 1 to n, to be more consistent with common programming languages. I also renamed the input string to S and Height to LCP:

Algorithm GetLCP:
input: A text S and its suffix array Pos
 1  for i:=0 to n-1 do
 2     Rank[Pos[i]] := i
 3  od
 4  h:=0
 5  for i:=0 to n-1 do
 6    if Rank[i] > 0 then
 7      k := Pos[Rank[i]-1]
 8      while S[i+h] = S[k+h] do
 9        h := h+1
10      od
11      LCP[Rank[i]] := h
12      if h>0 then h := h-1 fi
13    fi
14  od

Applying the algorithm

The algorithm takes as input a string S and its suffix array (which has to be computed by some other algorithm). So let us first compute the suffix array of your example string S=ababaa$. Taking all possible suffices of S and sorting them lexicographically gives us (note the empty suffix $ has been omitted, since it is not very interesting, as it is always first in the suffix array and has no common prefix with the next suffix; in the following, when I talk about all suffices, I mean all except the empty suffix):

| Suffix  | Starts at |
-----------------------
| a$      | 5         |
| aa$     | 4         |
| abaa$   | 2         |
| ababaa$ | 0         |
| baa$    | 3         |
| babaa$  | 1         |
-----------------------

In the pseudo-code from the paper, the suffix array is called Pos, holding the starting position of each suffix. So in our example, we would have Pos = [5,4,2,0,3,1]. If we would query e.g. Pos[0] we would get 5, which tells us that the lexicographically smallest suffix a$ is starting at position 5 in S.

The Rank array is intended to be a reverse-lookup for Pos. The rank of a suffix is its position in the suffix array, i.e. its rank in the lexicographic ordering of the suffices. So if Rank[i] = j, then the suffix starting at the ith position of S is located at position j in the suffix array, so Pos[j] = i. As can be seen in the pseudo-code, the Rank array can be computed by setting Rank[Pos[i]] = i.

| Starts At | Rank    |
-----------------------
| 0         | 3       |
| 1         | 5       |
| 2         | 2       |
| 3         | 4       |
| 4         | 1       |
| 5         | 0       |
-----------------------

So we have Rank = [3, 5, 2, 4, 1, 0]. We can see that for example the suffix babaa$, which starts at position 1 in S, has Rank[1] = 5, hence the suffix can be found at position 5 in the suffix array. In reverse, Pos[5] = 1, because the suffix at position 5 in the suffix array starts at position 1 in S.

The loop in the pseudo-code now iterates over the positions in S:

  • i=0: We are looking at the suffix starting at position 0 in S, which is ababaa$. We have Rank[0] = 3, which is > 0, so we know that the suffix has a preceding suffix in the suffix array.

    We want to determine the starting position of this preceding suffix. The preceding suffix is at position Rank[0]-1=2 in the suffix array. We have Pos[2] = 2. Therefore the suffix preceding ababaa$ in the suffix array is abaa$, starting at position 2 in S. We set k to be this starting position, so k=2.

    We determine the longest common prefix of the prefixes starting at i=0 and k=2. We currently have no information about how much they have in common (h=0), so we compare both suffices from the starting position:

    ababaa$
    abaa$
    

    We see that the length of the longest common prefix is 3. So we set h=0+3 and LCP[Rank[0]] = LCP[3] = 3. We then decrement h by one, so h=3-1=2.

  • i=1: We are looking at the suffix starting at position 1 in S, which is babaa$. We have Rank[1] = 5, which is > 0, so we know that the suffix has a preceding suffix in the suffix array.

    We want to determine the starting position of this preceding suffix. The preceding suffix is at position Rank[1]-1=4 in the suffix array. We have Pos[4] = 3. Therefore the suffix preceding babaa$ in the suffix array is baa$, starting at position 3 in S. We set k to be this starting position, so k=3.

    We determine the longest common prefix of the suffices starting at i=1 and k=3. From the previous iteration, we have h=2, so we already know that the first two characters of the suffices match, so we can start comparing from the third character onward:

    ba|baa$
    ba|a$
    

    We immediately have a mismatch, so no change to h=2 is done. We set LCP[Rank[1]] = LCP[5] = 2. We decrement h by one, so h now is h=2-1=1.

  • i=2: We are looking at the suffix starting at position 2 in S, which is abaa$. We have Rank[2] = 2, which is > 0, so we know that the suffix has a preceding suffix in the suffix array.

    We want to determine the starting position of this preceding suffix. The preceding suffix is at position Rank[2]-1=1 in the suffix array. We have Pos[1] = 4. Therefore the suffix preceding abaa$ in the suffix array is aa$, starting at position 4 in S. We set k to be this starting position, so k=4.

    We determine the longest common prefix of the suffices starting at i=2 and k=4. From the previous iteration, we have h=1, so we already know that the first character of the suffices matches, so we can start comparing from the second character onward:

     a|baa$
     a|a$
    

    We immediately get a mismatch, so no change to h=1 is done. We set LCP[Rank[2]] = LCP[2] = 1. We decrement h by one, so h now is h=1-1=0.

  • i=3: We are looking at the suffix starting at position 3 in S, which is baa$. We have Rank[3] = 4, which is > 0, so we know that the suffix has a preceding suffix in the suffix array.

    We want to determine the starting position of this preceding suffix. The preceding suffix is at position Rank[3]-1=3 in the suffix array. We have Pos[3] = 0. Therefore the suffix preceding baa$ in the suffix array is ababaa$, starting at position 0 in S. We set k to be this starting position, so k=0.

    We determine the longest common prefix of the suffices starting at i=3 and k=0. From the previous iteration, we have h=0, so we currently have no information about how much both suffices have in common, so we have to start comparing from the first character onward:

    baa$
    ababaa$
    

    We immediately get a mismatch, so no change to h=0 is done. We set LCP[Rank[3]] = LCP[4] = 0. Since h=0, we do not decrement h further.

  • i=4: We are looking at the suffix starting at position 4 in S, which is aa$. We have Rank[4] = 1, which is > 0, so we know that the suffix has a preceding suffix in the suffix array.

    We want to determine the starting position of this preceding suffix. The preceding suffix is at position Rank[4]-1=0 in the suffix array. We have Pos[0] = 5. Therefore the suffix preceding aa$ in the suffix array is a$, starting at position 5 in S. We set k to be this starting position, so k=5.

    We determine the longest common prefix of the suffices starting at i=4 and k=5. From the previous iteration, we have h=0, so we currently have no information about how much both suffices have in common, so we have to start comparing from the first character onward:

    aa$
    a$
    

    We see that the length of the longest common prefix is one, so we set h=0+1. We set LCP[Rank[4]] = LCP[1] = 1. We decrement h by one, so h=1-1=0.

  • i=5: We are looking at the suffix starting at position 5 in S, which is a$. We have Rank[5] = 0, which is not > 0, so we know that there is no preceding suffix in the suffix array. Therefore, we skip computing the LCP.

In conclusion, we obtained the following LCP entries:

| Suffix  | LCP with prev |
---------------------------
| a$      | n/a           |
| aa$     | 1             |
| abaa$   | 1             |
| ababaa$ | 3             |
| baa$    | 0             |
| babaa$  | 2             |
-----------------------

We can verify that each entry in the LCP gives us the length of the longest common prefix with the previous suffix.

Why does it work?

It should be clear that if h=0, the algorithm is correct, because it simply compares the current suffix with its predecessor in the suffix array, from the first character onwards. So we only need to verify that when h>0, the first h characters of the suffices that are skipped in the comparison already match.

Let us look at the first two iterations in the example. The first iteration was looking at the suffix ababaa$ occurring first in S and its predecessor abaa$ in the suffix array. In the second iteration, we were looking at the suffix babaa$ and its predecessor baa$ in the suffix array.

i=0
  ababaa$
  abaa$
i=1
  babaa$
  baa$

For i=0 we found the first three characters to match (h=3). We then decremented h by one, leading to h=2. In the second iteration we found that in the suffices the first h=2 characters match. Why is that? The suffix babaa$ was obtained by removing the first character of ababaa$, as we moved one position to the right. We see that baa$, the predecessor of babaa$ in the suffix array, happens to be obtained by removing the first character of abaa$, the predecessor of ababaa$ in the suffix array. So if the LCP of ababaa$ and its predecessor in the suffix array has length 3, it makes sense that if we consider the suffices obtained by removing the first character from each, the LCP of these suffices has length 3-1=2.

So for this example we saw that the LCP for i=1 was the LCP of i=0 minus one. So the next question to ask would be, is this true in general? The answer to this is no. In general it could happen that if we move from iteration i to iteration i+1, the i+1th suffix has a predecessor in the suffix array that is unequal to the predecessor in the suffix array of the ith suffix with the first character removed.

Consider i=2 and i=3 in the example. In i=2, we were looking at the suffix abaa$, which has the predecessor aa$ in the suffix array. In i=3, we were looking at the suffix baa$, which as the predecessor ababaa$ in the suffix array.

i=2
  abaa$
  aa$
i=3
  baa$
  ababaa$

The predecessor of i=3 (ababaa$) is unequal to the predecessor from i=2 with the first character removed (a$). However, we can observe that in the suffix array, the predecessor of i=3 (ababaa$) is "sandwiched" between the suffix from i=3 (baa$) and the predecessor from i=2 with the first character removed (a$). I.e. in the suffix array, ababaa$ occurs in between a$ and baa$. This is because in i=2 the suffix abaa$ was directly preceded by aa$ in the suffix array (i.e. aa$ is lexicographically smaller than abaa$) and their LCP is of length >0. Therefore, it follows that when removing the first character from each, the resulting two suffices will still have the same lexicographic order, i.e. a$ is lexicographically smaller than baa$, and therefore a$ comes before baa$ in the suffix array. Thus the direct predecessor of baa$ in the suffix array (ababaa$) has to occur in between a$ and baa$. This kind of "sandwiching" is the key insight for Kasai's algorithm.

Let us look at a different example to illustrate this. Let S = aaababab. The suffix array would be:

| Suffix    | Starts at |
-------------------------
| aaababab$ | 0         |
| aababab$  | 1         |
| ab$       | 6         |
| abab$     | 4         |
| ababab$   | 2         |
| b$        | 7         |
| bab$      | 5         |
| babab$    | 3         |
-------------------------

In iteration i=0 nothing happens because the suffix aaababab$ has no predecessor in the suffix array . In iteration i=1 we would be looking at suffix aababab$, which has the predecessor aaababab$ in the suffix array. In iteration i=2 we would be looking at suffix ababab$ (removing the first character from the suffix in i=1). Its predecessor in the suffix array is abab$.

i=1
  aababab$
  aaababab$
i=2
  ababab$
  abab$

We can see that the predecessor in i=2 (abab$) is different from the predecessor in i=1 with the first character removed (aababab$). However, in the suffix array, aababab$ is "sandwiching" the suffix from i=2 (ababab$) and its actual predecessor (abab$). In i=1 we determine that the LCP of aababab$ and its predecessor aaababab$ is h=2. At the end of the iteration we decrement h by one, so h=2-1=1. In i=2, we then see that we can indeed skip comparison of the first character. This is because we know that the suffix from i=2 (ababab$) and the predecessor from i=1 with the first character removed (aababab$) will match at least in the first 2-1=1 characters, as both were obtained from the suffices in i=1 (which matched in the first two characters), with the first character removed. Further, because the suffix array is lexicographically sorted, we know that all suffices sandwiched between these two will also have the same first 2-1=1 characters. Therefore, when we compute the LCP in i=2, it is valid to skip the first 2-1=1 characters.

This is true in general: When we go from iteration i to iteration i+1, for the suffix in iteration i+1 the direct predecessor in the suffix array is sandwiched between the suffix from iteration i+1 and the predecessor from iteration i with the first character removed. The LCP between the suffix from iteration i+1 and the predecessor from iteration i with the first character removed is equal to the LCP from iteration i minus one. Because the predecessor of the suffix in i+1 is sandwiched between the suffix from iteration i+1 and the predecessor from iteration i with the first character removed, it follows that the LCP from i+1 is at least the LCP from i, minus one. Therefore, in the algorithm it is valid to skip the first h characters when computing the LCP. This is Theorem 1 in the paper linked above.

like image 120
f9c69e9781fa194211448473495534 Avatar answered Sep 17 '22 05:09

f9c69e9781fa194211448473495534