Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Tiling Algorithm

I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD, CDE, ABC, A, and the resulting tiled string should be ABCDE, because BCD aligns with CDE yielding BCDE, which is then aligned with ABC yielding the final ABCDE.

Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD and CDE, I use the following (in Java):

public static String tile(String first, String second) {
  for (int i = 0; i < first.length() || i < second.length(); i++) {
    // "right" tile (e.g., "BCD" and "CDE")
    String firstTile = first.substring(i);
    // "left" tile (e.g., "CDE" and "BCD")  
    String secondTile = second.substring(i);
    if (second.contains(firstTile)) {
      return first.substring(0, i) + second;
    } else if (first.contains(secondTile)) {
      return second.substring(0, i) + first;
    }
  }
  return EMPTY;
}

System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ

Although this works, it's not very efficient, as it iterates over the same characters over and over again.

So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.

I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.

Thanks in advance.


For situations where there is ambiguity -- e.g., {ABC, CBA} which could result in ABCBA or CBABC --, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}, which are manipulated so that the aforementioned algorithm works.

Similar question: Efficient Algorithm for String Concatenation with Overlap

like image 290
João Silva Avatar asked Sep 17 '09 20:09

João Silva


People also ask

What is string tiling?

A new method, Greedy String Tiling (GST), is proposed for comparing pairs of strings (and hence files) to determine the degree to which they are similar. Greedy String Tiling is based on one-to-one matching, and is able to deal with transposition of substrings.

What is greedy string tiling algorithm?

Greedy String Tiling (GST) computes the similarity between two sequences and have been used in software code, free text or biological subsequences [23] . In contrast to other methods, GST can deal with the transposition of tokens [29]. ...


1 Answers

Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.

like image 90
Daniel C. Sobral Avatar answered Sep 17 '22 16:09

Daniel C. Sobral