I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD
, CDE
, ABC
, A
, and the resulting tiled string should be ABCDE
, because BCD
aligns with CDE
yielding BCDE
, which is then aligned with ABC
yielding the final ABCDE
.
Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD
and CDE
, I use the following (in Java):
public static String tile(String first, String second) {
for (int i = 0; i < first.length() || i < second.length(); i++) {
// "right" tile (e.g., "BCD" and "CDE")
String firstTile = first.substring(i);
// "left" tile (e.g., "CDE" and "BCD")
String secondTile = second.substring(i);
if (second.contains(firstTile)) {
return first.substring(0, i) + second;
} else if (first.contains(secondTile)) {
return second.substring(0, i) + first;
}
}
return EMPTY;
}
System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ
Although this works, it's not very efficient, as it iterates over the same characters over and over again.
So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.
I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.
Thanks in advance.
For situations where there is ambiguity -- e.g., {ABC, CBA}
which could result in ABCBA
or CBABC
--, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}
, which are manipulated so that the aforementioned algorithm works.
Similar question: Efficient Algorithm for String Concatenation with Overlap
A new method, Greedy String Tiling (GST), is proposed for comparing pairs of strings (and hence files) to determine the degree to which they are similar. Greedy String Tiling is based on one-to-one matching, and is able to deal with transposition of substrings.
Greedy String Tiling (GST) computes the similarity between two sequences and have been used in software code, free text or biological subsequences [23] . In contrast to other methods, GST can deal with the transposition of tokens [29]. ...
Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With