Sorry for the long title :)
In this problem, we have string S
of length n
, and string T
of length m
. We can check whether S
is a subsequence of string T
in time complexity O(n+m). It's really simple.
I am curious about: what if we can delete at most K
successive characters? For example, if K = 2
, we can make "ab"
from "accb"
, but not from "abcccb"
. I want to check if it's possible very fast.
I could only find obvious O(nm)
: check if it's possible for every suffix pairs in string S
and string T
. I thought maybe greedy algorithm could be possible, but if K = 2
, the case S = "abc"
and T = "ababbc"
is a counterexample.
Is there any fast solution to solve this problem?
The method indexOf() returns the position of the first occurrence of a given character in a string whereas method lastIndexOf() returns the position of the last occurrence of a given character in a string.
Strings are used for storing text/characters. For example, "Hello World" is a string of characters.
(Update: I've rewritten the opening of this answer to include a discussion of complexity and to discussion some alternative methods and potential risks.)
(Short answer, the only real improvement above the O(nm) approach that I can think of is to observe that we don't usually need to compute all n times m entries in the table. We can calculate only those cells we need. But in practice it might be very good, depending on the dataset.)
Clarify the problem: We have a string S
of length n
, and a string T
of length m
. The maximum allowed gap is k
- this gap is to be enforced at the beginning and end of the string also. The gap is the number of unmatched characters between two matched characters - i.e. if the letters are adjacent, that is a gap of 0
, not 1
.
Imagine a table with n+1
rows and m+1
columns.
0 1 2 3 4 ... m
--------------------
0 | ? ? ? ? ? ?
1 | ? ? ? ? ? ?
2 | ? ? ? ? ? ?
3 | ? ? ? ? ? ?
... |
n | ? ? ? ? ? ?
At first, we we could define that the entry in row r
and column c
is a binary flag that tells us whether the first r
characters of of S
is a valid k
-subsequence of the first c
characters of T
. (Don't worry yet how to compute these values, or even whether these values are useful, we just need to define them clearly first.)
However, this binary-flag table isn't very useful. It's not possible to easily calculate one cell as a function of nearby cells. Instead, we need each cell to store slightly more information. As well as recording whether the relevant strings are a valid subsequence, we need to record the number of consecutive unmatched characters at the end of our substring of T
(the substring with c
characters). For example, if the first r=2
characters of S
are "ab"
and the first c=3
characters of T
are "abb"
, then there are two possible matches here: The first characters obviously match with each other, but the b
can match with either of the latter b
. Therefore, we have a choice of leaving one or zero unmatched b
s at the end. Which one do we record in the table?
The answer is that, if a cell has multiple valid values, then we take the smallest one. It's logical that we want to make life as easy as possible for ourselves while matching the remainder of the string, and therefore that the smaller the gap at the end, the better. Be wary of other incorrect optmizations - we do not want to match as many characters as possible or as few characters. That can backfire. But it is logical, for a given pair of strings S,T
, to find the match (if there are any valid matches) that minimizes the gap at the end.
One other observation is that if the string S
is much shorter than T
, then it cannot match. This depends on k
also obviously. The maximum length that S
can cover is rk
, if this is less than c
, then we can easily mark (r,c)
as -1
.
(Any other optimization statements that can be made?)
We do not need to compute all the values in this table. The number of different possible states is k+3. They start off in an 'undefined' state (?
). If a matching is not possible for the pair of (sub)strings, the state is -
. If a matching is possible, then the score in the cell will be a number between 0 and k inclusive, recording the smallest possible number of unmatched consecutive characters at the end. This gives us a total of k+3 states.
We are interested only in the entry in the bottom right of the table. If f(r,c)
is the function that computes a particular cell, then we are interested only in f(n,m)
. The value for a particular cell can be computed as a function of the values nearby. We can build a recursive algorithm that takes r
and c
as input and performs the relevant calculations and lookups in term of the nearby values. If this function looks up f(r,c)
and finds a ?
, it will go ahead and compute it and then store the answer.
It is important to store the answer as the algorithm may query the same cell many times. But also, some cells will never be computed. We just start off attempting to calculate one cell (the bottom right) and just lookup-and-calculate-and-store as necessary.
This is the "obvious" O(nm) approach. The only optimization here is the observation that we don't need to calculate all the cells, therefore this should bring the complexity below O(nm). Of course, with really nasty datasets, you may end up calculating almost all of the cells! Therefore, it's difficult to put an official complexity estimate on this.
Finally, I should say how to compute a particular cell f(r,c)
:
r==0
and c <= k
, then f(r,c) = 0
. An empty string can match any string with up to k
characters in it.
r==0
and c > k
, then f(r,c) = -1
. Too long for a match.
S[r]==T[c]
and f(r-1,c-1) != -1
, then f(r,c) = 0
. This is the best case - a match with no trailing gap.
f(r,c-1) != -1
and f(r,c) < k
, then f(r,c) = f(r,c-1)+1
.f(r,c) = -1
.The rest of this answer is my initial, Haskell-based approach. One advantage of it is that it 'understands' that it needn't compute every cell, only computing cells where necessary. But it could make the inefficiency of calculating one cell many times.
*Also note that the Haskell approach is effectively approaching the problem in a mirror image - it trying to build matches from the end substrings of S
and T
where minimal leading bunch of unmatched characters. I don't have the time to rewrite it in its 'mirror image' form!
A recursive approach should work. We want a function that will take three arguments, int K
, String S
, and String T
. However, we don't just want a boolean answer as to whether S is a valid k-subsequence of T.
For this recursive approach, if S is a valid k-subsequence, we also want to know about the best subsequence possible by returning how few characters from the start of T can be dropped. We want to find the 'best' subsequence. If a k-subsequence is not possible for S and T, then we return -1, but if it is possible then we want to return the smallest number of characters we can pull from T while retaining the k-subsequence property.
helloworld
l r d
This is a valid 4-subsequence, but the biggest gap has (at most) four characters (lowo
). This is the best subsequence because it leaves a gap of just two characters at the start (he
). Alternatively, here is another valid k-subsequence with the same strings, but it's not as good because it leaves a gap of three at the start:
helloworld
l r d
This is written in Haskell, but it should be easy enough to rewrite in any other language. I'll break it down in more detail below.
best :: Int -> String -> String -> Int
-- K S T return
-- where len(S) <= len(T)
best k [] t_string -- empty S is a subsequence of anything!
| length(t_string) <= k = length(t_string)
| length(t_string) > k = -1
best k sss@(s:ss) [] = (-1) -- if T is empty, and S is non-empty, then no subsequence is possible
best k sss@(s:ss) tts@(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
A line-by-line analysis:
(A comment in Haskell starts with --
)
best :: Int -> String -> String -> Int
A function that takes an Int, and two Strings, and that returns an Int. The return value is to be -1 if a k-subsequence is not possible. Otherwise it will return an integer between 0 and K (inclusive) telling us the smallest possible gap at the start of T.
We simply deal with the cases in order.
best k [] t -- empty S is a subsequence of anything!
| length(t) <= k = length(t)
| length(t) > k = -1
Above, we handle the case where S is empty ([]
). This is simple, as an empty string is always a valid subsequence. But to test if it is a valid k-subsequence, we must calculate the length of T.
best k sss@(s:ss) [] = (-1)
-- if T is empty, and S is non-empty, then no subsequence is possible
That comment explains it. This leaves us with the situations where both strings are non-empty:
best k sss@(s:ss) tts@(t:ts) -- both are non-empty. Various possibilities:
| s == t && best k ss ts /= -1 = 0 -- if s==t, and if best k ss ts != -1, then we have the best outcome
| best k sss ts /= -1
&& best k sss ts < k = 1+ (best k sss ts) -- this is the only other possibility for a valid k-subsequence
| otherwise = -1 -- no more options left, return -1 for failure.
tts@(t:ts)
matches a non-empty string. The name of the string is tts
. But there is also a convenient trick in Haskell to allow you to give names to the first letter in the string (t
) and the remainder of the string (ts
). Here ts
should be read aloud as the plural of t
- the s
suffix here means 'plural'. We say have have a t
and some ts
and together they make the full (non-empty) string.
That last block of code deals with the case where both strings are non-empty. The two strings are called sss
and tts
. But to save us the hassle of writing head sss
and tail sss
to access the first letter, and the string-remainer, of the string, we simply use @(s:ss)
to tell the compiler to store those quantities into variables s
and ss
. If this was C++ for example, you'd get the same effect with char s = sss[0];
as the first line of your function.
The best situation is that the first characters match s==t
and the remainder of the strings are a valid k-subsequence best k sss ts /= -1
. This allows us to return 0.
The only other possibility for success if if the current complete string (sss
) is a valid k-subsequence of the remainder of the other string (ts
). We add 1 to this and return, but making an exception if the gap would grow too big.
It's very important not to change the order of those last five lines. They are order in decreasing order of how 'good' the score is. We want to test for, and return the very best possibilities first.
Naive recursive solution. Bonus := return value is the number of ways that the string can be matched.
#include <stdio.h>
#include <string.h>
unsigned skipneedle(char *haystack, char *needle, unsigned skipmax)
{
unsigned found,skipped;
// fprintf(stderr, "skipneedle(%s,%s,%u)\n", haystack, needle, skipmax);
if ( !*needle) return strlen(haystack) <= skipmax ? 1 : 0 ;
found = 0;
for (skipped=0; skipped <= skipmax ; haystack++,skipped++ ) {
if ( !*haystack ) break;
if ( *haystack == *needle) {
found += skipneedle(haystack+1, needle+1, skipmax);
}
}
return found;
}
int main(void)
{
char *ab = "ab";
char *test[] = {"ab" , "accb" , "abcccb" , "abcb", NULL}
, **cpp;
for (cpp = test; *cpp; cpp++ ) {
printf( "[%s,%s,%u]=%u \n"
, *cpp, ab, 2
, skipneedle(*cpp, ab, 2) );
}
return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With