I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:
Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.
For example, given the input strings
eatsleepnightxyz
andeatsleepabcxyz
, the results should be:
eatsleep
(due toeatsleepnightxyz
eatsleepabcxyz
)xyz
(due toeatsleepnightxyz
eatsleepabcxyz
)a
(due toeatsleepnightxyz
eatsleepabcxyz
)t
(due toeatsleepnightxyz
eatsleepabcxyz
)However, the result set should not include
e
fromeatsleepnightxyz
eatsleepabcxyz
, because bothe
s are already contained in theeatsleep
mentioned above. Nor should you includeea
,eat
,ats
, etc., as those are also all covered byeatsleep
.In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.
My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.
For String S1: Find all the substrings of S1 of all the lengths While doing so: Check if it is also a substring of S2.
Attempt to figure out the time complexity of my approach.
Let the two given strings be n1-String and n2-String
Attempt to find m in terms of n1.
Tn = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where Tn is the sum of lengths of all the substrings.
Average will be the division of this sum by the total number of Substrings produced.
This, simply is a summation and division problem whose solution is as follows O(n)
Therefore...
Running time of my algorithm is O(n^5).
With this in mind I wrote the following code:
package pack.common.substrings; import java.util.ArrayList; import java.util.LinkedHashSet; import java.util.List; import java.util.Set; public class FindCommon2 { public static final Set<String> commonSubstrings = new LinkedHashSet<String>(); public static void main(String[] args) { printCommonSubstrings("neerajisgreat", "neerajisnotgreat"); System.out.println(commonSubstrings); } public static void printCommonSubstrings(String s1, String s2) { for (int i = 0; i < s1.length();) { List<String> list = new ArrayList<String>(); for (int j = i; j < s1.length(); j++) { String subStr = s1.substring(i, j + 1); if (isSubstring(subStr, s2)) { list.add(subStr); } } if (!list.isEmpty()) { String s = list.get(list.size() - 1); commonSubstrings.add(s); i += s.length(); } } } public static boolean isSubstring(String s1, String s2) { boolean isSubstring = true; int strLen = s2.length(); int strToCheckLen = s1.length(); if (strToCheckLen > strLen) { isSubstring = false; } else { for (int i = 0; i <= (strLen - strToCheckLen); i++) { int index = i; int startingIndex = i; for (int j = 0; j < strToCheckLen; j++) { if (!(s1.charAt(j) == s2.charAt(index))) { break; } else { index++; } } if ((index - startingIndex) < strToCheckLen) { isSubstring = false; } else { isSubstring = true; break; } } } return isSubstring; } }
Explanation for my code:
printCommonSubstrings: Finds all the substrings of S1 and checks if it is also a substring of S2. isSubstring : As the name suggests, it checks if the given string is a substring of the other string.
Issue: Given the inputs
S1 = “neerajisgreat”; S2 = “neerajisnotgreat” S3 = “rajeatneerajisnotgreat”
In case of S1 and S2, the output should be: neerajis
and great
but in case of S1 and S3, the output should have been: neerajis
, raj
, great
, eat
but still I am getting neerajis
and great
as output. I need to figure this out.
How should I design my code?
For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.
Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.
You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.
The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:
public static Set<String> longestCommonSubstrings(String s, String t) { int[][] table = new int[s.length()][t.length()]; int longest = 0; Set<String> result = new HashSet<>(); for (int i = 0; i < s.length(); i++) { for (int j = 0; j < t.length(); j++) { if (s.charAt(i) != t.charAt(j)) { continue; } table[i][j] = (i == 0 || j == 0) ? 1 : 1 + table[i - 1][j - 1]; if (table[i][j] > longest) { longest = table[i][j]; result.clear(); } if (table[i][j] == longest) { result.add(s.substring(i - longest + 1, i + 1)); } } } return result; }
Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz
and eatsleepabcxyz
:
e a t s l e e p a b c x y z e 1 0 0 0 0 1 1 0 0 0 0 0 0 0 a 0 2 0 0 0 0 0 0 1 0 0 0 0 0 t 0 0 3 0 0 0 0 0 0 0 0 0 0 0 s 0 0 0 4 0 0 0 0 0 0 0 0 0 0 l 0 0 0 0 5 0 0 0 0 0 0 0 0 0 e 1 0 0 0 0 6 1 0 0 0 0 0 0 0 e 1 0 0 0 0 1 7 0 0 0 0 0 0 0 p 0 0 0 0 0 0 0 8 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 t 0 0 1 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 1 0 0 y 0 0 0 0 0 0 0 0 0 0 0 0 2 0 z 0 0 0 0 0 0 0 0 0 0 0 0 0 3
eatsleep
result is obvious: that's the 12345678
diagonal streak at the top-left.xyz
result is the 123
diagonal at the bottom-right.a
result is indicated by the 1
near the top (second row, ninth column).t
result is indicated by the 1
near the bottom left.What about the other 1
s at the left, the top, and next to the 6
and 7
? Those don't count because they appear within the rectangle formed by the 12345678
diagonal — in other words, they are already covered by eatsleep
.
I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.
Typically this type of substring matching is done with the assistance of a separate data structure called a Trie (pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With