Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find Longest Common Substring using C++

Tags:

c++

algorithm

lcs

I searched online for a C++ Longest Common Substring implementation but failed to find a decent one. I need a LCS algorithm that returns the substring itself, so it's not just LCS.

I was wondering, though, about how I can do this between multiple strings.

My idea was to check the longest one between 2 strings, and then go check all the others, but this is a very slow process which requires managing many long strings on the memory, making my program quite slow.

Any idea of how this can be speeded up for multiple strings? Thank you.

Important Edit One of the variables I'm given determines the number of strings the longest common substring needs to be in, so I can be given 10 strings, and find the LCS of them all (K=10), or LCS of 4 of them, but I'm not told which 4, I have to find the best 4.

like image 482
David Gomes Avatar asked Apr 20 '12 15:04

David Gomes


People also ask

How do you find common substrings?

For every character in string 1 we increment vector index of that character eg: v[s1[i]-'a']++, for every character of string 2 we check vector for the common characters if v[s2[i]-'a'] > 0 then set flag = true and v[s2[i]-'a']– such that one character of string 2 is compared with only one character of string 1.

How do you calculate LCS?

Using Dynamic Programming to find the LCSCreate a table of dimension n+1*m+1 where n and m are the lengths of X and Y respectively. The first row and the first column are filled with zeros. Fill each cell of the table using the following logic.


3 Answers

Here is an excellent article on finding all common substrings efficiently, with examples in C. This may be overkill if you need just the longest, but it may be easier to understand than the general articles about suffix trees.

like image 70
Adrian McCarthy Avatar answered Sep 22 '22 18:09

Adrian McCarthy


The answer is GENERALISED SUFFIX TREE. http://en.wikipedia.org/wiki/Generalised_suffix_tree

You can build a generalised suffix tree with multiple string.

Look at this http://en.wikipedia.org/wiki/Longest_common_substring_problem

The Suffix tree can be built in O(n) time for each string, k*O(n) in total. K is total number of strings.

So it's very quick to solve this problem.

like image 27
Lxcypp Avatar answered Sep 21 '22 18:09

Lxcypp


This is a dynamic programming problem and can be solved in O(mn) time, where m is the length of one string and n is of other.

Like any other problem solved using Dynamic Programming, we will divide the problem into subproblem. Lets say if two strings are x1x2x3....xm and y1y2y3...yn

S(i,j) is the longest common string for x1x2x3...xi and y1y2y3....yj, then

S(i,j) = max { length of longest common substring ending at xi/yj, if ( x[i] == y[j] ), S(i-1, j-1), S(i, j-1), S(i-1, j) }

Here is working program in Java. I am sure you can convert it to C++.:

public class LongestCommonSubstring {

    public static void main(String[] args) {
        String str1 = "abcdefgijkl";
        String str2 = "mnopabgijkw";
        System.out.println(getLongestCommonSubstring(str1,str2));
    }

    public static String getLongestCommonSubstring(String str1, String str2) {
        //Note this longest[][] is a standard auxialry memory space used in Dynamic
                //programming approach to save results of subproblems. 
                //These results are then used to calculate the results for bigger problems
        int[][] longest = new int[str2.length() + 1][str1.length() + 1];
        int min_index = 0, max_index = 0;

                //When one string is of zero length, then longest common substring length is 0
        for(int idx = 0; idx < str1.length() + 1; idx++) {
            longest[0][idx] = 0;
        }

        for(int idx = 0; idx < str2.length() + 1; idx++) {
            longest[idx][0] = 0;
        }

        for(int i = 0; i <  str2.length(); i++) {
            for(int j = 0; j < str1.length(); j++) {

                int tmp_min = j, tmp_max = j, tmp_offset = 0;

                if(str2.charAt(i) == str1.charAt(j)) {
                    //Find length of longest common substring ending at i/j
                    while(tmp_offset <= i && tmp_offset <= j &&
                            str2.charAt(i - tmp_offset) == str1.charAt(j - tmp_offset)) {

                        tmp_min--;
                        tmp_offset++;

                    }
                }
                //tmp_min will at this moment contain either < i,j value or the index that does not match
                //So increment it to the index that matches.
                tmp_min++;

                //Length of longest common substring ending at i/j
                int length = tmp_max - tmp_min + 1;
                //Find the longest between S(i-1,j), S(i-1,j-1), S(i, j-1)
                int tmp_max_length = Math.max(longest[i][j], Math.max(longest[i+1][j], longest[i][j+1]));

                if(length > tmp_max_length) {
                    min_index = tmp_min;
                    max_index = tmp_max;
                    longest[i+1][j+1] = length;
                } else {
                    longest[i+1][j+1] = tmp_max_length;
                }


            }
        }

        return str1.substring(min_index, max_index >= str1.length() - 1 ? str1.length() - 1 : max_index + 1);
    }
}
like image 44
snegi Avatar answered Sep 23 '22 18:09

snegi