Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most effective way to lookup a substring C of string B in string A in LINQ

Having 2 strings like:

string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";

I would like to just delete the part that is common in both strings and then the rest concatenate it. I have to tell that what I need to delete only left matched part so I would get

input

 ATTAGACCTGCCGGAA
          GCCGGAATAC

output

ATTAGACCTGCCGGAATAC

Firstly I thought to use a pattern and then seacrh for it, however this is not possible as I do not know the pattern in advance (the length of matched chars is variable)

Then I thought on search whole string b in a then if had no succes, delete a char in string a (Last one since I want to preserve most left unmatched string) and then loop until I have no more chars in b like

string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";
int times = b.Length;
string wantedString = string.Empty;
string auxString = b;
while (times > 0)
{

    if (!a.Contains(auxString))
    {
        //save last char and then delete it from auxString
        wantedString += auxString[auxString.Length - 1];
        auxString = auxString.TrimEnd(auxString[auxString.Length - 1]);
    }
    else
        break;
    times--;
}
//reverse string 
char[] reversedToAppend = wantedString.ToCharArray();
Array.Reverse(reversedToAppend);
string toAppend = new string(reversedToAppend);

so the answer would be just to do a + toAppend ;
Is there a way to make this more efficient? (maybe in LINQ?)

Edit

As @lavin points out correctly c can occur anywhere in a, while being a prefix of b. for instance if a=AAT and b=AAG, code should return AATG. the reason is because common string starting on left is c=AA. We delete this from b and then we get a=AAT with the resulting G

AAT
AAG

resulting

AATG

Other example would be:

a=ATTTGGGCCGCGCGCGAAAACCCCGCG
b=                  AACCCCGCGCGCA

here

c= AACCCCGCG

so result should be

result = ATTTGGGCCGCGCGCGAAAACCCCGCGCGCA
like image 722
edgarmtze Avatar asked Jul 16 '15 21:07

edgarmtze


1 Answers

(all arrays and strings are 0 based in this answer)

First I want to point out that OP's problem is confusing. Assume c is the common part of a and b, OP's example of input and output suggest that c needs to be the suffix of a, and the prefix of b at the same time. I see some of the answers above adopted this understanding of the problem.

However, the original implementation provided by OP suggests that, c can occur anywhere in a, while being a prefix of b, because your using of a.Contains(auxString). That means for a=AAT and b=AAG, your code will return AATG. However other people's answers will return AATAAG.

So there are two possible interpretation of your problem. Please clarify.

Second, assuming the size of the first string a is N, and the second string b is M, unlike the O(N*M) solution provided in the original solution and existing answers, an O(N+M) algorithm can be achieved by using any of the following: KMP, Suffix Array, Suffix Tree, Z-algorithm.

I'll briefly describe how to use Z-algorithm to solve this problem here, since it seems to be much less mentioned on stackoverflow compared to others.

About details of Z-algorithm, see http://www.cs.umd.edu/class/fall2011/cmsc858s/Lec02-zalg.pdf

Basically for a string S of length L, it calculates an array Z of length L, in which Z[i] equals to the longest common prefix of S and S[i:] (S[i:] means substring of S starting from position i).

For this problem, we combine strings a and b into d=b+a (b in front of a), and calculates the Z array of the combined string d. Using this Z array, we can easily figure out the longest prefix of b that also occurs in a.

For possible interpretation one of the problem, in which c needs to be the suffix of a and prefix of b:

max_prefix = 0
for i in range(M, N+M):
  if Z[i] == N+M - i: 
    if Z[i] > max_prefix:
      max_prefix = Z[i]

and the answer would be:

a+b[max_prefix:]

For possible interpretation two of the problem, in which c needs to be the prefix of b, and can be anywhere in a:

max_prefix = 0
for i in range(M, N+M):
  if Z[i] > max_prefix:
    max_prefix = Z[i]

again the answer would be:

a+b[max_prefix:]

The difference in those two cases are this line:

  if Z[i] == N+M-i: 

To understand this line, remember that Z[i] is the longest common prefix of strings d and d[i:], then:

  1. Note that d=b+a
  2. We enumerate i from M to M+N-1, that's the range of a in d. So d[i:] is equal to a[i-M:]. And the length of a[i-M:] is N-(i-M)=N+M-i.
  3. Since d starts with b, checking if Z[i] is equal to N+M-i, is checking if a[i-M:] is also a prefix of b. If they are indeed equal, then we found a common string c, which is the prefix of b, and also a suffix of a.
  4. Without this line, we only know that we found a string c which is a prefix of b, and occurs in a starting from position i, and is not guaranteed to reach the end of a.
like image 73
lavin Avatar answered Nov 15 '22 00:11

lavin