Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the longest repeated substring without suffix arrays or suffix trees

The longest repeated substring problem is the following:

Given a string w, find the longest substring of w that appears in at least two locations.

This problem can be solved in linear time using suffix trees and in linear time using enhanced suffix arrays.

My question is this - are there any linear-time algorithms for this problem that do not involve suffix trees or suffix arrays? I'm curious because suffix trees and suffix arrays are famously difficult to code up and manipulate and it would be nice if there were an algorithm for this problem that didn't require the coding or memory overhead of these other structures.

Thanks!

like image 782
templatetypedef Avatar asked May 03 '14 21:05

templatetypedef


People also ask

How do you find the longest repeated substring in a string?

The maximum value of LCSRe(i, j) provides the length of the longest repeating substring and the substring itself can be found using the length and the ending index of the common suffix.

How do you find the longest repeating substring in C++?

Longest Repeating Substring in C++ Suppose we have a string S, we have to find the length of the longest repeating substring(s). We will return 0 if no repeating substring is present. So if the string is like “abbaba”, then the output will be 2. As the longest repeating substring is “ab” or “ba”.


1 Answers

After digging on this for a while, I did find an alternative to suffix trees and suffix arrays. The implementation itself is simple, like you wanted, but even though the code is (roughly) high on brevity, it is extremely low on intuition.

There is a paper, The Suffix Binary Search Tree and Suffix AVL Tree, by Robert W. Irving and Lorna Love, from University of Glasgow, that presents an alternative solution to suffix trees and suffix arrays. The authors claim that managing and building a suffix binary search tree is substantially simpler when compared to suffix trees, and show that construction can be guaranteed to take no longer than O(n log(n)) in the case of a suffix AVL tree (this is also true for the standard, non-balanced original BST in the average case, but it can degrade to O(n^2) in the worst case).

This is, of course, worse than traditional suffix trees, since the time to find the longest repeated substring in an SBST is bounded by the time of building it. So, strictly speaking, this doesn't answer your question, but I decided to post this for future reference and for interested readers.

Furthermore, the paper states that the resulting tree is a strong candidate alternative to traditional suffix trees and suffix arrays. The paper is written around the problem of finding a pattern in logarithmic time after building the tree. You can download it from CiteSeerX.

The idea is that each node is associated with an integer i, where i is the offset of the first character in the target text of the suffix that that node represents. For a given string [a1,a2,a3,...,an] of length n, there will be n nodes, and node i represents the suffix [ai,ai+1,...,an].

The paper doesn't address the problem of determining the longest repeated substring, but the proposed way to build the binary search tree makes it easy to solve the longest repeated substring problem. I will go through that extension once I talk about the building process. It is crucial to understand how to build the tree in order to understand the extension to solve the longest repeated string problem (and otherwise the code will look like magic).

Hence, I will explain first how the tree is built, and will do so based on the ideas I got from the paper. Formal proofs, along with other details (including how to turn this into an AVL tree) are discussed in the paper. You are free to jump ahead to the section where I talk about how I extended the algorithm, but keep in mind that it may be hard to understand if you do so.

The paper describes an optimized algorithm for searching a pattern in a suffix binary search tree that avoids comparing the pattern with the suffix from a node from scratch on each node we visit. Doing so would, in the worst case, require l*m character comparisons, where m is the size of the pattern, and l is the search path. We don't want to compare characters from scratch on each node to decide on which direction to branch; instead, it is desirable to compare each character in the pattern at most once. This is possible if we store two additional attributes per node. The process of building the tree is tightly bound with this optimization (and with my proposed extension), so it is very important to understand it.

First, some terminology: for a given node i, a node j is said to be a right ancestor of i if i is in the left subtree of j. The left ancestor is defined similarly. The closest right ancestor of i is the node j such that no descendant of j is a right ancestor of i; again, a similar definition applies for the closest left ancestor. From now on, I will refer to the closest right ancestor of i using the the abbreviated form cra(i), and the closest left ancestor of i is abbreviated to cla(i). We also define the longest common prefix between two nodes j and i, and we represent it as lcp(i, j).

For a given node i, we will store two attributes, m(i) and d(i). m(i) represents the value for which lcp(i, j) is highest, where j is the set of ancestors of i. Note that, due to the binary tree's properties, node j is either cla(i) or cra(i). d(i) is an attribute to keep track of where m(i) comes from; if j == cla(i), then d(i) is RIGHT (meaning that i is in the right subtree of the node j for which lcp(i, j) is maximized), otherwise, d(i) is LEFT.

What follows is a set of theorems that together build the basic algorithm to perform a search of a given pattern in an SBST. The theorems describe what to do when the search for a pattern arrives at a node i. Please refer to the paper for the formal proofs, I will try to provide an intuitive proof of why the rules are like this. Together, these theorems form a set of rules that allows the algorithm to search for a pattern by comparing each character in the pattern at most once, which is pretty neat!

When a search arrives at node i, we make use of 2 values, llcp and rlcp. llcp is the value for which lcp(pattern, j) is highest, where j is the set of all right ancestors of i. rlcp is the same, but the maximum is taken over all left ancestors of i. Again, due do the binary tree's properties, llcp is just lcp(pattern, cra(i)), and rlcp is lcp(pattern, cla(i)).

Before diving into the theorems, I think it is a good idea to draw a sample SBST on a piece of paper and visualize the semantical meaning of each theorem on the tree.

Theorem 1 is the simplest and covers the case m(i) > max(llcp, rlcp). If that happens, llcp and rlcp don't change, because we will be able to match as much with i as we did with its ancestors, and the search branches in the same direction as d(i). To see why, consider the case that d(i) == LEFT. If d(i) == LEFT, then it means that m(i) comes from a match with cra(i). If we're visiting i, it is because we already know that the pattern is lower than cra(i), and lcp(cra(i), pattern) < lcp(i, cra(i)), which makes the pattern less than i, so we move left. The same process can be applied for the case d(i) == RIGHT, so, indeed, we just need to follow the direction of d(i).

Theorem 2 deals with the case mi < max(llcp, rlcp). This one is harder to understand. Let's see what happens when max(llcp, rlcp) == llcp and when max(llcp, rlcp) == rlcp.

Case 1: max(llcp, rlcp) == llcp

If max(llcp, rlcp) == llcp, then it means that the pattern has more in common with node i's closest right ancestor than with its closest left ancestor. Furthermore, because llcp > m(i), the pattern has more in common with cra(i) than node i has with cra(i). This, together with the fact that i is lower than cra(i) (by definition of BST), implies that the pattern is greater than i. Thus, we branch right.

What about the updates for llcp and rlcp? Because we're branching right, cra(i) will be the same in the next iteration, so llcp remains unchanged. Updating rlcp is a bit more tricky. When we branch right, the new cla will be node i. What happens next depends on whether m(i) came from cra(i) or cla(i). We can use d(i) to know that: if m(i) comes from cra(i), then d(i) == LEFT, otherwise, d(i) == RIGHT.

Case 1.1: max(llcp, rlcp) == llcp && d(i) == RIGHT

In this case, we know that m(i) comes from cla(i), which means that node i has more in common with cla(i) than with cra(i) (and remember, the pattern has more in common with cra(i)). As we saw before, we'll be branching right, which makes the pattern greater than i. This implies that cla < i < pattern, and, therefore, the longest common prefix between pattern and i is the same as between pattern and cla; or in other words, lcp(i, cla) > lcp(i, pattern) (otherwise, pattern would have to be less than i), so rlcp, which is lcp(i, pattern), remains the same.

Case 1.2: max(llcp, rlcp) == llcp && d(i) == LEFT

Now, we know that m(i) comes from cra(i), but the pattern has more in common with cra(i) than i has with cra(i). This means that node i acts as a "bottleneck", making rlcp smaller - rlcp can only be as big as the common prefix between i and cra(i), which is equal to m(i). So, in this case, rlcp is the same as m(i).

A similar analysis can be performed for the reverse situation, where max(llcp, rlcp) == rlcp (and then consider the sub-cases d(i) == RIGHT and d(i) == LEFT). The actions to perform are the reverse version of the previous cases: we branch left, rlcp remains unchanged, and llcp becomes m(i) if d(i) == RIGHT, otherwise, it remains unchanged.

In short:

Theorem 2 results

                                   d(i) == RIGHT                d(i) == LEFT                      

max(llcp, rlcp) == llcp |            branch right        |   branch right; rlcp = m(i)
max(llcp, rlcp) == rlcp |       branch left; llcp = m(i) |         branch left

Theorem 3 explores two cases when m(i) is equal to the longest common prefix of the pattern with another ancestor. In particular, if m(i) == llcp && llcp > rlcp && d(i) == RIGHT, we know that i matches as much with cla as the pattern with cra. Since d(i) == RIGHT, and m(i) == llcp, it follows that lcp(i, cra) < llcp, and i < cra, implying that the pattern is greater than i - we branch right. A similar argument holds for the opposite case; if m(i) == rlcp && rlcp > llcp && d(i) == LEFT, we branch left. In either case, both llcp and rlcp will stay the same: in the former, llcp would never change because cra is still the same, and rlcp won't change because d(i) == RIGHT && m(i) == llcp, i.e., matching the pattern with cla or with i is the same; the same happens in the latter case.

Theorem 4 is where we have to perform actual character comparisons. This happens every time we cannot infer anything about the relative order between the pattern and the current node, that is, when m(i) == rlcp == llcp, or m(i) == llcp && llcp > rlcp && d(i) == LEFT, or the reverse m(i) == rlcp && rlcp > llcp && d(i) == RIGHT. Intuitively, we know that nothing can be infered about the order, and character comparisons are performed beginning on the first character that is not part of the longest common prefix, and stopping on the first that is different (at which point we can reason about the order). Again, if we branch right, then cra remains the same, so llcp doesn't change, and rlcp will now hold the value of the newly computed longest common prefix between the pattern and node i. A similar process happens for the reverse case: we branch left, rlcp stays untouched, and llcp becomes the previously computed value of the longest common prefix. This theorem is intuitively correct, so I won't go any further.

Theorem 3 & 4 results

                                     d(i) == RIGHT                d(i) == LEFT                      
m(i) == llcp && llcp > rlcp |         branch right       |               *
m(i) == rlcp && rlcp > llcp |             *              |         branch left

* = compare(); if branch == left then rlcp = computed_lcp else llcp = computed_lcp

Here's the pseudo-code that brings all of this together, taken from page 9:

enter image description here

My modifications to the algorithm

It may take a little while to wrap your head around the awesomeness of this algorithm, but once you grasp all the details, it can be seen that the problem of finding the longest repeated substring is equivalent to finding the node i for which m(i) is greatest. After all, the longest repeated substring is the longest common prefix that can ever be found between any two nodes. This can be found without any significant overhead if we keep track of it while building the tree: the maximum m(i) seen so far must be kept, and whenever a new node j is inserted, we compare m(j) with the maximum seen so far, and update it if m(j) is greater. In fact, this approach is nothing more than a fancy way to implement an algorithm that is equivalent to sorting all the suffixes and finding the longest common prefix between any two consecutive suffixes, with the advantage of not performing unnecessary character comparisons. That's a considerably good improvement.

The pseudo-code shown above is almost enough to build the standard SBST. We start by adding a root node, with i == 1, representing the whole text. Suffixes are then added left to right. To insert a new suffix i, we search for a pattern that is suffix i. Doing so will make the algorithm stop at the exact spot of insertion. However, the paper doesn't go into much detail on the insertion process. We must be careful with the return i; in the final else for Theorem 4. We can only return if we're searching. If we're doing an insertion and reach a node that brings us to that case of theorem 4, then it means that all of the characters in the new suffix match with some previously inserted suffix. Because suffixes are being inserted from left to right, we also know that the new suffix has less characters than the other one, which means that the new suffix is lower than the other: the correct move is to branch left. Since we branched left, the closest left ancestor stays the same, so we only need to update llcp. llcp becomes the size of the suffix itself, because as we saw, all of it is matched with the node that is now the closest right ancestor.

Obviously, the new node will have a value of m(i) equal to max(llcp, rlcp), and d(i) will, by definition, be RIGHT if max(llcp, rlcp) == rlcp, and LEFT otherwise.

My implementation in C boils down to the pseudo-code, together with the insertion logic. There are two data structures: struct sbst represents a suffix binary search tree along with the maximum m(i) seen so far; and struct node, which is the descriptor for a tree's node.

Here's the full program listing:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LEFT 0
#define RIGHT 1
#define MODE_INSERT 1
#define MODE_FIND 2
#define MAX_TEXT_SIZE 1024
#define max(a,b) ((a)>(b)?(a):(b))

struct node {
  int m;
  int d;
  int i;
  struct node *left;
  struct node *right;
};

struct sbst {
  struct node *root;
  int max_mi; /* The maximum lcp in the tree */
  int max_i; /* The node i with m(i) == max_mi */
};

struct node *allocate_node(int m, int d, int i) {
  struct node *new_node = malloc(sizeof(*new_node));
  new_node->m = m;
  new_node->d = d;
  new_node->i = i;
  new_node->left = new_node->right = NULL;
  return new_node;
}

int lcp(char *str1, char *str2);

/* The core where all the work takes place.
   It is assumed that when this function is called, tree->root always points to a valid, allocated root. That is, it is assumed
   that the tree contains at least one node.
   This function provides both a find and an insert algorithm. To find a pattern, <mode> must be MODE_FIND. To insert,
   <mode> must be MODE_INSERT. In the latter case, the parameter <text_i> corresponds to the index in the original string
   of the suffix being inserted (index starts counting at 1, as described in the paper). Also, in MODE_INSERT, the pattern is
   the suffix being inserted.
   If in MODE_FIND, this function returns the index (starting at 1) in the text where <pattern> can be found, or 0 if no such
   pattern could be found.
*/
int find_insert_aux(struct sbst *tree, char *pattern, size_t pattern_len, char *text, size_t text_len, char mode, int text_i) {
  struct node *current, *prev;
  int llcp, rlcp;
  int next_dir;

  current = tree->root;
  llcp = rlcp = 0;

  while (current != NULL) {
    int max_pattern = max(llcp, rlcp);
    if (current->m > max_pattern) {
      next_dir = current->d;
    } else if (current->m < max_pattern) {
      if (llcp > rlcp) {
    next_dir = RIGHT;
    if (current->d == LEFT) {
      rlcp = current->m;
    }
      } else if (rlcp > llcp) {
    next_dir = LEFT;
    if (current->d == RIGHT) {
      llcp = current->m;
    }
      }
    } else if (current->m == llcp && llcp > rlcp && current->d == RIGHT) {
      next_dir = RIGHT;
    } else if (current->m == rlcp && rlcp > llcp && current->d == LEFT) {
      next_dir = LEFT;
    } else {
      int sub_lcp = lcp(pattern+current->m, text+current->m+current->i-1);
      int t = current->m + sub_lcp;
      if (t == pattern_len) {
    if (mode == MODE_FIND) {
      return current->i;
    } else {
      next_dir = LEFT;
      llcp = t;
    }
      } else if (current->i+t-1 == text_len || pattern[t] > text[t+current->i-1]) {
    next_dir = RIGHT;
    rlcp = t;
      } else {
    next_dir = LEFT;
    llcp = t;
      }
    }
    prev = current;
    current = (next_dir == RIGHT ? current->right : current->left);
  }
  if (mode == MODE_INSERT) {
    struct node *new_node = allocate_node(max(llcp, rlcp), (llcp > rlcp ? LEFT : RIGHT), text_i);
    if (next_dir == LEFT)
      prev->left = new_node;
    else
      prev->right = new_node;
    if (new_node->m > tree->max_mi) {
      tree->max_mi = new_node->m;
      tree->max_i = new_node->i;
    }
  }
  return 0;
}

void sbst_insert(struct sbst *tree, char *text, size_t text_size, int i) {
  (void) find_insert_aux(tree, text+i-1, text_size-i+1, text, text_size, MODE_INSERT, i);
}

int sbst_find(struct sbst *tree, char *text, size_t text_size, char *pattern, size_t pattern_size) {
  return find_insert_aux(tree, pattern, pattern_size, text, text_size, MODE_FIND, 0);
}

/* Builds a Suffix Binary Search Tree that keeps track of the highest m(i) as it is built. */
struct sbst *build_sbst(char *text, size_t text_size) {
  if (*text == '\0')
    return NULL;

  struct sbst *tree = malloc(sizeof(*tree));
  tree->root = allocate_node(0, 0, 1);
  tree->max_mi = 0;
  tree->max_i = 1;

 for (int i = 1; text[i] != '\0'; i++)
   sbst_insert(tree, text, text_size, i+1);

  return tree;
}

/* Given an SBST for the input, finds the longest repeated substring in O(1)
   Stores the offset in *offset, and the size of the lrs in *size
*/
void find_lrs(struct sbst *tree, int *offset, int *size) {
  *offset = tree->max_i-1;
  *size = tree->max_mi;
}

/* Debug section */
void dump(struct node *n, char *text, int depth) {
  if (!n)
    return;
  for (int i = 0; i < depth; i++)
    putchar(' '), putchar(' ');
  printf("%d|%d|%d|%s\n", n->m, n->d, n->i, text+n->i-1);
  dump(n->left, text, depth+1);
  dump(n->right, text, depth+1);
}

void dump_sorted(struct node *n, char *text) {
  if (!n)
    return;
  dump_sorted(n->left, text);
  printf("%s\n", text+n->i-1);
  dump_sorted(n->right, text);
}
/* End debug section */

int lcp(char *str1, char *str2) {
  int i;
  for (i = 0; str1[i] != '\0' && str1[i] == str2[i]; i++);
  return i;
}

int main(void) {
  char text[MAX_TEXT_SIZE];
  printf("Enter text, hit RETURN to terminate (max. %d chars): ", MAX_TEXT_SIZE-1);

  fgets(text, sizeof text, stdin);
  size_t text_size = strlen(text);

  /* Trim newline */
  text[--text_size] = '\0';

  struct sbst *tree = build_sbst(text, text_size);

  /* Debug */
  #ifdef DEBUG_MODE
  dump(tree->root, text, 0);
  printf("\n");
  dump_sorted(tree->root, text);
  #endif

  int lrs_offset, lrs_size;
  find_lrs(tree, &lrs_offset, &lrs_size);
  if (lrs_size == 0)
    printf("No longest repeated substring.\n");
  else {
    printf("Longest repeated substring found at offset %d with size %d: %.*s\n", lrs_offset, lrs_size, lrs_size, text+lrs_offset);
  }
  return 0;
}

Had I not read the paper, I would have looked upon this code as black magic.

Note that this is not exactly a good demonstration of the best software engineering practices: someone who is not familiar with the algorithm will have a very hard time reading and understanding it, it leaks memory, it doesn't check the return value of malloc, and has some other flaws too, but I think it's enough to make my point.

While this may not be as optimal as a suffix tree, it is clearly simple to build and provides a good starting point. For example, as a by-product, it can perform pattern matching in logarithmic time - that's pretty good!

NOTE: I haven't had much time to test the implementation. I did some basic testing and it seems to be working, but I can't assure there are no errors.

like image 117
Filipe Gonçalves Avatar answered Dec 09 '22 20:12

Filipe Gonçalves