Having 2 strings like:
string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";
I would like to just delete the part that is common in both strings and then the rest concatenate it. I have to tell that what I need to delete only left matched part so I would get
ATTAGACCTGCCGGAA
GCCGGAATAC
ATTAGACCTGCCGGAATAC
Firstly I thought to use a pattern and then seacrh for it, however this is not possible as I do not know the pattern in advance (the length of matched chars is variable)
Then I thought on search whole string b
in a
then if had no succes, delete a char in string a
(Last one since I want to preserve most left unmatched string) and then loop until I have no more chars in b
like
string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";
int times = b.Length;
string wantedString = string.Empty;
string auxString = b;
while (times > 0)
{
if (!a.Contains(auxString))
{
//save last char and then delete it from auxString
wantedString += auxString[auxString.Length - 1];
auxString = auxString.TrimEnd(auxString[auxString.Length - 1]);
}
else
break;
times--;
}
//reverse string
char[] reversedToAppend = wantedString.ToCharArray();
Array.Reverse(reversedToAppend);
string toAppend = new string(reversedToAppend);
so the answer would be just to do a + toAppend ;
Is there a way to make this more efficient? (maybe in LINQ?)
As @lavin points out correctly c
can occur anywhere in a
, while being a prefix of b. for instance if a=AAT
and b=AAG
, code should return AATG
.
the reason is because common string starting on left is c=AA
. We delete this from b
and then we get a=AAT
with the resulting G
AAT
AAG
resulting
AATG
Other example would be:
a=ATTTGGGCCGCGCGCGAAAACCCCGCG
b= AACCCCGCGCGCA
here
c= AACCCCGCG
so result should be
result = ATTTGGGCCGCGCGCGAAAACCCCGCGCGCA
(all arrays and strings are 0 based in this answer)
First I want to point out that OP's problem is confusing. Assume c
is the common part of a
and b
, OP's example of input and output suggest that c
needs to be the suffix of a
, and the prefix of b
at the same time. I see some of the answers above adopted this understanding of the problem.
However, the original implementation provided by OP suggests that, c
can occur anywhere in a
, while being a prefix of b
, because your using of a.Contains(auxString)
. That means for a=AAT
and b=AAG
, your code will return AATG
. However other people's answers will return AATAAG
.
So there are two possible interpretation of your problem. Please clarify.
Second, assuming the size of the first string a
is N
, and the second string b
is M
, unlike the O(N*M)
solution provided in the original solution and existing answers, an O(N+M)
algorithm can be achieved by using any of the following: KMP, Suffix Array, Suffix Tree, Z-algorithm.
I'll briefly describe how to use Z-algorithm to solve this problem here, since it seems to be much less mentioned on stackoverflow compared to others.
About details of Z-algorithm, see http://www.cs.umd.edu/class/fall2011/cmsc858s/Lec02-zalg.pdf
Basically for a string S
of length L
, it calculates an array Z
of length L
, in which Z[i]
equals to the longest common prefix of S
and S[i:]
(S[i:]
means substring of S
starting from position i
).
For this problem, we combine strings a
and b
into d=b+a
(b
in front of a
), and calculates the Z
array of the combined string d
. Using this Z
array, we can easily figure out the longest prefix of b
that also occurs in a
.
For possible interpretation one of the problem, in which c
needs to be the suffix of a
and prefix of b
:
max_prefix = 0
for i in range(M, N+M):
if Z[i] == N+M - i:
if Z[i] > max_prefix:
max_prefix = Z[i]
and the answer would be:
a+b[max_prefix:]
For possible interpretation two of the problem, in which c
needs to be the prefix of b
, and can be anywhere in a
:
max_prefix = 0
for i in range(M, N+M):
if Z[i] > max_prefix:
max_prefix = Z[i]
again the answer would be:
a+b[max_prefix:]
The difference in those two cases are this line:
if Z[i] == N+M-i:
To understand this line, remember that Z[i]
is the longest common prefix of strings d
and d[i:]
, then:
d=b+a
i
from M
to M+N-1
, that's the range of a
in d
. So d[i:]
is equal to a[i-M:]
. And the length of a[i-M:]
is N-(i-M)=N+M-i
.d
starts with b
, checking if Z[i]
is equal to N+M-i
, is checking if a[i-M:]
is also a prefix of b
. If they are indeed equal, then we found a common string c
, which is the prefix of b
, and also a suffix of a
.c
which is a prefix of b
, and occurs in a
starting from position i
, and is not guaranteed to reach the end of a
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With