We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,
"a" + "b" + "c" => "abc"
"abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
"abcdede" + "dedefgh" + "" => "abcdedefgh"
"abcde" + "d" + "ghlmn" => "abcdedghlmn"
"abcdef" + "" + "defghl" => "abcdefghl"
Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?
Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.
Our current brute-force implementation in Java is attached for reference,
public static String concat(String s1, String s2) {
if (s1 == null)
return s2;
if (s2 == null)
return s1;
int len = Math.min(s1.length(), s2.length());
// Find the index for the end of overlapping part
int index = -1;
for (int i = len; i > 0; i--) {
String substring = s2.substring(0, i);
if (s1.endsWith(substring)) {
index = i;
break;
}
}
StringBuilder sb = new StringBuilder(s1);
if (index < 0)
sb.append(s2);
else if (index <= s2.length())
sb.append(s2.substring(index));
return sb.toString();
}
There are two ways to concatenate strings in Java: By + (String concatenation) operator. By concat() method.
StringBuilder is a widely used and recommended way to concatenate two strings in Java. It is mutable, unlike string, meaning that we can change the value of the object.
Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!
Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.
That gives you an O(n) algorithm. Nice!
int OverlappedStringLength(string s1, string s2) {
//Trim s1 so it isn't longer than s2
if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);
int[] T = ComputeBackTrackTable(s2); //O(n)
int m = 0;
int i = 0;
while (m + i < s1.Length) {
if (s2[i] == s1[m + i]) {
i += 1;
//<-- removed the return case here, because |s1| <= |s2|
} else {
m += i - T[i];
if (i > 0) i = T[i];
}
}
return i; //<-- changed the return here to return characters matched
}
int[] ComputeBackTrackTable(string s) {
var T = new int[s.Length];
int cnd = 0;
T[0] = -1;
T[1] = 0;
int pos = 2;
while (pos < s.Length) {
if (s[pos - 1] == s[cnd]) {
T[pos] = cnd + 1;
pos += 1;
cnd += 1;
} else if (cnd > 0) {
cnd = T[cnd];
} else {
T[pos] = 0;
pos += 1;
}
}
return T;
}
OverlappedStringLength("abcdef", "defghl") returns 3
You may use a DFA. For example, a string XYZ
should be read by the regular expression ^((A)?B)?C
. That regular expression will match the longest prefix which matches a suffix of the XYZ
string. With such a regular expression you can either match and get the match result, or generate a DFA, on which you can use the state to indicate the proper position for the "cut".
In Scala, the first implementation -- using regex directly -- might go like this:
def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
val regex = toRegex(s1)
val prefix = regex findFirstIn s2 getOrElse ""
s1 + s2.drop(prefix length)
}
For example:
scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd
scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef
scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn
Here's a solution in Python. It should be faster just by not needing to build substrings in memory all the time. The work is done in the _concat function, which concatenates two strings. The concat function is a helper that concatenates any number of strings.
def concat(*args):
result = ''
for arg in args:
result = _concat(result, arg)
return result
def _concat(a, b):
la = len(a)
lb = len(b)
for i in range(la):
j = i
k = 0
while j < la and k < lb and a[j] == b[k]:
j += 1
k += 1
if j == la:
n = k
break
else:
n = 0
return a + b[n:]
if __name__ == '__main__':
assert concat('a', 'b', 'c') == 'abc'
assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
assert concat('abcdef', '', 'defghl') == 'abcdefghl'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With