Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding all the common substrings of given two strings

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:

Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.

For example, given the input strings eatsleepnightxyz and eatsleepabcxyz, the results should be:

  • eatsleep (due to eatsleepnightxyz eatsleepabcxyz)
  • xyz (due to eatsleepnightxyz eatsleepabcxyz)
  • a (due to eatsleepnightxyz eatsleepabcxyz)
  • t (due to eatsleepnightxyz eatsleepabcxyz)

However, the result set should not include e from eatsleepnightxyz eatsleepabcxyz, because both es are already contained in the eatsleep mentioned above. Nor should you include ea, eat, ats, etc., as those are also all covered by eatsleep.

In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.

My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.

 For String S1:      Find all the substrings of S1 of all the lengths      While doing so: Check if it is also a substring of       S2. 

Attempt to figure out the time complexity of my approach.

Let the two given strings be n1-String and n2-String

  1. The number of substrings of S1 is clearly n1(n1+1)/2.
  2. But we have got to find the average length a substring of S1.
  3. Let’s say it is m. We’ll find m separately.
  4. Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).
  5. Now, we are checking for each m-String is a substring of S2, which is an n2-String.
  6. This, as we have seen above, is an O(n2 m) algorithm.
  7. The time required by the overall algorithm then is
  8. Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
  9. By performing certain calculations, I came to conclusion that the time complexity is O(n3 m2)
  10. Now, our job is to find m in terms of n1.

Attempt to find m in terms of n1.

Tn = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where Tn is the sum of lengths of all the substrings.

Average will be the division of this sum by the total number of Substrings produced.

This, simply is a summation and division problem whose solution is as follows O(n)

Therefore...

Running time of my algorithm is O(n^5).

With this in mind I wrote the following code:

 package pack.common.substrings;   import java.util.ArrayList;  import java.util.LinkedHashSet;  import java.util.List;  import java.util.Set;   public class FindCommon2 {     public static final Set<String> commonSubstrings = new      LinkedHashSet<String>();   public static void main(String[] args) {     printCommonSubstrings("neerajisgreat", "neerajisnotgreat");     System.out.println(commonSubstrings); }   public static void printCommonSubstrings(String s1, String s2) {     for (int i = 0; i < s1.length();) {         List<String> list = new ArrayList<String>();         for (int j = i; j < s1.length(); j++) {             String subStr = s1.substring(i, j + 1);             if (isSubstring(subStr, s2)) {                 list.add(subStr);             }         }         if (!list.isEmpty()) {             String s = list.get(list.size() - 1);             commonSubstrings.add(s);             i += s.length();         }     }  }   public static boolean isSubstring(String s1, String s2) {     boolean isSubstring = true;     int strLen = s2.length();     int strToCheckLen = s1.length();     if (strToCheckLen > strLen) {         isSubstring = false;     } else {         for (int i = 0; i <= (strLen - strToCheckLen); i++) {             int index = i;             int startingIndex = i;             for (int j = 0; j < strToCheckLen; j++) {                 if (!(s1.charAt(j) == s2.charAt(index))) {                     break;                 } else {                     index++;                 }             }             if ((index - startingIndex) < strToCheckLen) {                 isSubstring = false;             } else {                 isSubstring = true;                 break;             }         }     }     return isSubstring;  } } 

Explanation for my code:

 printCommonSubstrings: Finds all the substrings of S1 and                          checks if it is also a substring of                          S2.  isSubstring : As the name suggests, it checks if the given string                 is a substring of the other string. 

Issue: Given the inputs

  S1 = “neerajisgreat”;   S2 = “neerajisnotgreat”   S3 = “rajeatneerajisnotgreat” 

In case of S1 and S2, the output should be: neerajis and great but in case of S1 and S3, the output should have been: neerajis, raj, great, eat but still I am getting neerajis and great as output. I need to figure this out.

How should I design my code?

like image 819
theimpatientcoder Avatar asked Jan 15 '16 06:01

theimpatientcoder


People also ask

How do you find the common substring of two strings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.

How do you find common characters in two strings?

Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.


2 Answers

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

public static Set<String> longestCommonSubstrings(String s, String t) {     int[][] table = new int[s.length()][t.length()];     int longest = 0;     Set<String> result = new HashSet<>();      for (int i = 0; i < s.length(); i++) {         for (int j = 0; j < t.length(); j++) {             if (s.charAt(i) != t.charAt(j)) {                 continue;             }              table[i][j] = (i == 0 || j == 0) ? 1                                              : 1 + table[i - 1][j - 1];             if (table[i][j] > longest) {                 longest = table[i][j];                 result.clear();             }             if (table[i][j] == longest) {                 result.add(s.substring(i - longest + 1, i + 1));             }         }     }     return result; } 

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

  e a t s l e e p a b c x y z e 1 0 0 0 0 1 1 0 0 0 0 0 0 0 a 0 2 0 0 0 0 0 0 1 0 0 0 0 0 t 0 0 3 0 0 0 0 0 0 0 0 0 0 0 s 0 0 0 4 0 0 0 0 0 0 0 0 0 0 l 0 0 0 0 5 0 0 0 0 0 0 0 0 0 e 1 0 0 0 0 6 1 0 0 0 0 0 0 0 e 1 0 0 0 0 1 7 0 0 0 0 0 0 0 p 0 0 0 0 0 0 0 8 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i 0 0 0 0 0 0 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 t 0 0 1 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 1 0 0 y 0 0 0 0 0 0 0 0 0 0 0 0 2 0 z 0 0 0 0 0 0 0 0 0 0 0 0 0 3 
  • The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
  • The xyz result is the 123 diagonal at the bottom-right.
  • The a result is indicated by the 1 near the top (second row, ninth column).
  • The t result is indicated by the 1 near the bottom left.

What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

like image 87
200_success Avatar answered Sep 22 '22 08:09

200_success


Typically this type of substring matching is done with the assistance of a separate data structure called a Trie (pronounced try). The specific variant that best suits this problem is a suffix tree. Your first step should be to take your inputs and build a suffix tree. Then you'll need to use the suffix tree to determine the longest common substring, which is a good exercise.

like image 38
ktbiz Avatar answered Sep 24 '22 08:09

ktbiz