I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows: <blockquote> Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings. For example, given the input strings <code>eatsleepnightxyz</code> and <code>eatsleepabcxyz</code>, the results should be: <ul> <li> <code>eatsleep</code> (due to <code>eatsleepnightxyz</code> <code>eatsleepabcxyz</code>)</li> <li> <code>xyz</code> (due to <code>eatsleepnightxyz</code> <code>eatsleepabcxyz</code>)</li> <li> <code>a</code> (due to <code>eatsleepnightxyz</code> <code>eatsleepabcxyz</code>)</li> <li> <code>t</code> (due to <code>eatsleepnightxyz</code> <code>eatsleepabcxyz</code>)</li> </ul> However, the result set should not include <code>e</code> from <code>eatsleepnightxyz</code> <code>eatsleepabcxyz</code>, because both <code>e</code>s are already contained in the <code>eatsleep</code> mentioned above. Nor should you include <code>ea</code>, <code>eat</code>, <code>ats</code>, etc., as those are also all covered by <code>eatsleep</code>. In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace. </blockquote> My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding. <pre class="prettyprint"><code> For String S1: Find all the substrings of S1 of all the lengths While doing so: Check if it is also a substring of S2. </code></pre> Attempt to figure out the time complexity of my approach. Let the two given strings be n1-String and n2-String <ol> <li>The number of substrings of S1 is clearly n1(n1+1)/2. </li> <li>But we have got to find the average length a substring of S1. </li> <li>Let’s say it is m. We’ll find m separately.</li> <li>Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).</li> <li>Now, we are checking for each m-String is a substring of S2, which is an n2-String. </li> <li>This, as we have seen above, is an O(n2 m) algorithm.</li> <li>The time required by the overall algorithm then is </li> <li>Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)</li> <li>By performing certain calculations, I came to conclusion that the time complexity is O(n3 m2) </li> <li>Now, our job is to find m in terms of n1.</li> </ol> Attempt to find m in terms of n1. Tn = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n) where Tn is the sum of lengths of all the substrings. Average will be the division of this sum by the total number of Substrings produced. This, simply is a summation and division problem whose solution is as follows O(n) Therefore... Running time of my algorithm is O(n^5). With this in mind I wrote the following code: <pre class="prettyprint"><code> package pack.common.substrings; import java.util.ArrayList; import java.util.LinkedHashSet; import java.util.List; import java.util.Set; public class FindCommon2 { public static final Set<String> commonSubstrings = new LinkedHashSet<String>(); public static void main(String[] args) { printCommonSubstrings("neerajisgreat", "neerajisnotgreat"); System.out.println(commonSubstrings); } public static void printCommonSubstrings(String s1, String s2) { for (int i = 0; i < s1.length();) { List<String> list = new ArrayList<String>(); for (int j = i; j < s1.length(); j++) { String subStr = s1.substring(i, j + 1); if (isSubstring(subStr, s2)) { list.add(subStr); } } if (!list.isEmpty()) { String s = list.get(list.size() - 1); commonSubstrings.add(s); i += s.length(); } } } public static boolean isSubstring(String s1, String s2) { boolean isSubstring = true; int strLen = s2.length(); int strToCheckLen = s1.length(); if (strToCheckLen > strLen) { isSubstring = false; } else { for (int i = 0; i <= (strLen - strToCheckLen); i++) { int index = i; int startingIndex = i; for (int j = 0; j < strToCheckLen; j++) { if (!(s1.charAt(j) == s2.charAt(index))) { break; } else { index++; } } if ((index - startingIndex) < strToCheckLen) { isSubstring = false; } else { isSubstring = true; break; } } } return isSubstring; } } </code></pre> Explanation for my code: <pre class="prettyprint"><code> printCommonSubstrings: Finds all the substrings of S1 and checks if it is also a substring of S2. isSubstring : As the name suggests, it checks if the given string is a substring of the other string. </code></pre> Issue: Given the inputs <pre class="prettyprint"><code> S1 = “neerajisgreat”; S2 = “neerajisnotgreat” S3 = “rajeatneerajisnotgreat” </code></pre> In case of S1 and S2, the output should be: <code>neerajis</code> and <code>great</code> but in case of S1 and S3, the output should have been: <code>neerajis</code>, <code>raj</code>, <code>great</code>, <code>eat</code> but still I am getting <code>neerajis</code> and <code>great</code> as output. I need to figure this out. How should I design my code?

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming. The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring: <pre class="prettyprint"><code>public static Set<String> longestCommonSubstrings(String s, String t) { int[][] table = new int[s.length()][t.length()]; int longest = 0; Set<String> result = new HashSet<>(); for (int i = 0; i < s.length(); i++) { for (int j = 0; j < t.length(); j++) { if (s.charAt(i) != t.charAt(j)) { continue; } table[i][j] = (i == 0 || j == 0) ? 1 : 1 + table[i - 1][j - 1]; if (table[i][j] > longest) { longest = table[i][j]; result.clear(); } if (table[i][j] == longest) { result.add(s.substring(i - longest + 1, i + 1)); } } } return result; } </code></pre> Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs <code>eatsleepnightxyz</code> and <code>eatsleepabcxyz</code>: <pre class="prettyprint"><code> e a t s l e e p a b c x y z e 1 0 0 0 0 1 1 0 0 0 0 0 0 0 a 0 2 0 0 0 0 0 0 1 0 0 0 0 0 t 0 0 3 0 0 0 0 0 0 0 0 0 0 0 s 0 0 0 4 0 0 0 0 0 0 0 0 0 0 l 0 0 0 0 5 0 0 0 0 0 0 0 0 0 e 1 0 0 0 0 6 1 0 0 0 0 0 0 0 e 1 0 0 0 0 1 7 0 0 0 0 0 0 0 p 0 0 0 0 0 0 0 8 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 t 0 0 1 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 1 0 0 y 0 0 0 0 0 0 0 0 0 0 0 0 2 0 z 0 0 0 0 0 0 0 0 0 0 0 0 0 3 </code></pre> <ul> <li>The <code>eatsleep</code> result is obvious: that's the <code>12345678</code> diagonal streak at the top-left.</li> <li>The <code>xyz</code> result is the <code>123</code> diagonal at the bottom-right.</li> <li>The <code>a</code> result is indicated by the <code>1</code> near the top (second row, ninth column).</li> <li>The <code>t</code> result is indicated by the <code>1</code> near the bottom left.</li> </ul> What about the other <code>1</code>s at the left, the top, and next to the <code>6</code> and <code>7</code>? Those don't count because they appear within the rectangle formed by the <code>12345678</code> diagonal — in other words, they are already covered by <code>eatsleep</code>. I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

Finding all the common substrings of given two strings

Tags:

java

string

algorithm

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:

Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.

For example, given the input strings eatsleepnightxyz and eatsleepabcxyz, the results should be:

eatsleep (due to eatsleepnightxyz eatsleepabcxyz)

xyz (due to eatsleepnightxyz eatsleepabcxyz)

a (due to eatsleepnightxyz eatsleepabcxyz)

t (due to eatsleepnightxyz eatsleepabcxyz)

However, the result set should not include e from eatsleepnightxyz eatsleepabcxyz, because both es are already contained in the eatsleep mentioned above. Nor should you include ea, eat, ats, etc., as those are also all covered by eatsleep.

In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.

My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.

 For String S1:      Find all the substrings of S1 of all the lengths      While doing so: Check if it is also a substring of       S2.

Attempt to figure out the time complexity of my approach.

Let the two given strings be n1-String and n2-String

The number of substrings of S1 is clearly n1(n1+1)/2.
But we have got to find the average length a substring of S1.
Let’s say it is m. We’ll find m separately.
Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).
Now, we are checking for each m-String is a substring of S2, which is an n2-String.
This, as we have seen above, is an O(n² m) algorithm.
The time required by the overall algorithm then is
Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
By performing certain calculations, I came to conclusion that the time complexity is O(n³ m²)
Now, our job is to find m in terms of n1.

Attempt to find m in terms of n1.

T_n = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where T_n is the sum of lengths of all the substrings.

Average will be the division of this sum by the total number of Substrings produced.

This, simply is a summation and division problem whose solution is as follows O(n)

Therefore...

Running time of my algorithm is O(n^5).

With this in mind I wrote the following code:

 package pack.common.substrings;   import java.util.ArrayList;  import java.util.LinkedHashSet;  import java.util.List;  import java.util.Set;   public class FindCommon2 {     public static final Set<String> commonSubstrings = new      LinkedHashSet<String>();   public static void main(String[] args) {     printCommonSubstrings("neerajisgreat", "neerajisnotgreat");     System.out.println(commonSubstrings); }   public static void printCommonSubstrings(String s1, String s2) {     for (int i = 0; i < s1.length();) {         List<String> list = new ArrayList<String>();         for (int j = i; j < s1.length(); j++) {             String subStr = s1.substring(i, j + 1);             if (isSubstring(subStr, s2)) {                 list.add(subStr);             }         }         if (!list.isEmpty()) {             String s = list.get(list.size() - 1);             commonSubstrings.add(s);             i += s.length();         }     }  }   public static boolean isSubstring(String s1, String s2) {     boolean isSubstring = true;     int strLen = s2.length();     int strToCheckLen = s1.length();     if (strToCheckLen > strLen) {         isSubstring = false;     } else {         for (int i = 0; i <= (strLen - strToCheckLen); i++) {             int index = i;             int startingIndex = i;             for (int j = 0; j < strToCheckLen; j++) {                 if (!(s1.charAt(j) == s2.charAt(index))) {                     break;                 } else {                     index++;                 }             }             if ((index - startingIndex) < strToCheckLen) {                 isSubstring = false;             } else {                 isSubstring = true;                 break;             }         }     }     return isSubstring;  } }

Explanation for my code:

 printCommonSubstrings: Finds all the substrings of S1 and                          checks if it is also a substring of                          S2.  isSubstring : As the name suggests, it checks if the given string                 is a substring of the other string.

Issue: Given the inputs

  S1 = “neerajisgreat”;   S2 = “neerajisnotgreat”   S3 = “rajeatneerajisnotgreat”

In case of S1 and S2, the output should be: neerajis and great but in case of S1 and S3, the output should have been: neerajis, raj, great, eat but still I am getting neerajis and great as output. I need to figure this out.

How should I design my code?

819

asked Jan 15 '16 06:01

theimpatientcoder

2 Answers

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

public static Set<String> longestCommonSubstrings(String s, String t) {     int[][] table = new int[s.length()][t.length()];     int longest = 0;     Set<String> result = new HashSet<>();      for (int i = 0; i < s.length(); i++) {         for (int j = 0; j < t.length(); j++) {             if (s.charAt(i) != t.charAt(j)) {                 continue;             }              table[i][j] = (i == 0 || j == 0) ? 1                                              : 1 + table[i - 1][j - 1];             if (table[i][j] > longest) {                 longest = table[i][j];                 result.clear();             }             if (table[i][j] == longest) {                 result.add(s.substring(i - longest + 1, i + 1));             }         }     }     return result; }

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

  e a t s l e e p a b c x y z e 1 0 0 0 0 1 1 0 0 0 0 0 0 0 a 0 2 0 0 0 0 0 0 1 0 0 0 0 0 t 0 0 3 0 0 0 0 0 0 0 0 0 0 0 s 0 0 0 4 0 0 0 0 0 0 0 0 0 0 l 0 0 0 0 5 0 0 0 0 0 0 0 0 0 e 1 0 0 0 0 6 1 0 0 0 0 0 0 0 e 1 0 0 0 0 1 7 0 0 0 0 0 0 0 p 0 0 0 0 0 0 0 8 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 t 0 0 1 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 1 0 0 y 0 0 0 0 0 0 0 0 0 0 0 0 2 0 z 0 0 0 0 0 0 0 0 0 0 0 0 0 3

The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
The xyz result is the 123 diagonal at the bottom-right.
The a result is indicated by the 1 near the top (second row, ninth column).
The t result is indicated by the 1 near the bottom left.

What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

answered Sep 22 '22 08:09

200_success

Typically this type of substring matching is done with the assistance of a separate data structure called a Trie (pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.

answered Sep 24 '22 08:09

ktbiz

Related questions
                            
                                How do I get an inner class to inherit enclosing class' generic type?
                            
                                Maven: javac: source release 1.6 requires target release 1.6
                            
                                Android MediaPlayer takes long time to prepare and buffer
                            
                                Generate javadoc only with JRE
                            
                                How to write integration tests with spring-cloud-netflix and feign
                            
                                Java socket.io client
                            
                                What is the difference between Typed and UnTyped Actors in Akka? When to use what?
                            
                                Jersey exception only thrown when depencencies assembled into a single jar
                            
                                No matter what, I can't batch MySQL INSERT statements in Hibernate
                            
                                MobileAds.initialize(Context, String) is deprecated
                            
                                How can I transition from Java to Clojure?
                            
                                Unit testing real-time / concurrent software [duplicate]
                            
                                How can i use JSTL variable in scriptlet?
                            
                                Scala parser combinators vs ANTLR/Java generated parser?
                            
                                Pass, return and convert to vectors list of lists over JNI
                            
                                Generic methods returning dynamic object types
                            
                                Shutdown hook doesn't work in Eclipse [duplicate]
                            
                                Stop Eclipse/Java running multiple instances
                            
                                Android 4.4 WebView file chooser not opening?
                            
                                Replace a class within the Java class library with a custom version

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With