I modified the title so that it is more understandable.
Here is a detailed version of the question:
We have a string s
and want to split it into substrings. Each substring is different from each other. What is the maximum number of unique substrings that we can have from one cut. In other words, what is the maximum number of unique substrings that concatenate to form s
.
Here are some examples:
Example 1
s = 'aababaa'
output = 4
Explain: we can split `s` into aa|b|aba|a or aab|a|b|aa,
and 4 is the max number of substrings we can get from one split.
Example 2
s = 'aba'
output = 2
Explain: a|ba
Example 3
s = 'aaaaaaa'
output = 3
Explain: a|aa|aaaa
Note: s
only contains lowercase characters. I am not told how long s
and hence cannot guess the optimal time complexity. :(
Is it a NP-hard problem? If not, how can I solve it efficiently?
I heard this problem from one of my friend and couldn't answer it. I am trying to use a Trie + greedy to solve this problem. The method fails for the first example.
Here is the Trie solution that I came up with:
def triesolution(s):
trie = {}
p = trie
output = 0
for char in s:
if char not in p:
output += 1
p[char] = {}
p = trie
else:
p = p[char]
return output
For example 1, the above code will return 3 since it is trying to split s
into a|ab|abaa
.
Add: Thanks to everyone's idea, it looks like this problem is very close to an NP problem. Right now, I am trying to think it from this direction. Suppose we have a function Guess(n)
. This function will return True
if we could find n
unique substrings from one split or False
otherwise. One observation here is that if Guess(n) == True
, then Guess(i) == True
for all i <= n
. Since we can merge two adjacent substrings together. This observation can lead to a binary solution. However, it still requires we can compute the Guess
function very efficiently. Sadly, I still could not find out a polynomial way to compute Guess(n)
.
This is known as the collision-aware string partition problem and is shown to be NP-complete by a reduction from 3-SAT in a paper by Anne Condon, Ján Maňuch and Chris Thachuk - Complexity of a collision-aware string partition problem and its relation to oligo design for gene synthesis (International Computing and Combinatorics Conference, 265-275, 2008).
(Many thanks to Gilad Barkan (גלעד ברקן) for making me aware of this discussion.)
Let me share my thoughts about this problem from a purely theoretical point of view (note that I also use "factor" instead of "subword").
I think a sufficiently formal definition of the problem (or problems) considered here is the following:
Given a word w, find words u_1, u_2, ..., u_k such that
Maximisation variant (we want many u_i): maximize k
Minimisation variant (we want short u_i): minimise max{|u_i| : 1 <= i <= k}
These problems become decision problems by additionally giving a bound B, which, according to whether we are talking about the "many-factors"-variant or the "short factors"-variant, is a lower bound on k (we want at least B factors), or an upper bound on max{|u_i| : 1 <= i <= k} (we want factors of length at most B), respectively. For talking about NP-hardness, we need to talk about decision problems.
Let's use the terms SF for the "short factors"-variant and MF for the "many factors"-variant. In particular, and this is a really crucial point, the problems are defined in such a way that we get a word over some alphabet that is not in any way restricted. The problem version were we know a priori that we only get input words over, say, alphabet {a, b, c, d} is a different problem! NP-hardness does not automatically carry over from the "unrestricted" to the "fixed alphabet" variant (the latter might be simpler).
Both SF and MF are NP-complete problems. This has been shown in [1, 1b] and [2], respectively (as Gilad has pointed out already). If I understand the (maybe too) informal problem definition here at the beginning of this discussion correctly, then the problem of this discussion is exactly the problem MF. It is initially not mentioned that the words are restricted to come from some fixed alphabet, later it is said that we can assume that only lower-case letters are used. If this means that we only consider words over the fixed alphabet {a, b, c, ..., z}, then this would change a lot actually in terms of NP-hardness.
A closer look reveals some differences in complexity of SF and MF:
Some comments on these result: W.r.t. (1) and (2), it is intuitively clear that if the alphabet is binary, then, in order to make the problem SF difficult, the bound B cannot be fixed as well. Conversely, fixing B = 2 means that the alphabet size must get rather large in order to produce difficult instances. As a consequence, (3) is rather trivial (in fact, [3] says slightly more: we can then solve it in running time not only polynomial, but also |w|^2 times a factor that only depends on the alphabet size and bound B). (5) is not difficult as well: If our word is long in comparison to B, then we can get the desired factorisation by simply slitting into factors of different lengths. If not, then we can brute-force all possibilities, which is exponential only in B, which in this case is assumed to be a constant.
So the picture we have is the following: SF seems more difficult, because we have hardness even for fixed alphabets or for a fixed bound B. The problem MF, on the other hand, gets poly-time solvable if the bound is fixed (in this regard it is easier than SF), while the corresponding question w.r.t. the alphabet size is open. So MF is slightly less complex than SF, even if it turns out that MF for fixed alphabets is also NP-complete. However, if it can be shown that MF can be solved for fixed alphabets in poly-time, then MF is shown to be much easier than SF... because the one case for which it is hard is somewhat artificial (unbounded alphabet!).
I did put some effort into trying to resolve the case of MF with bounded alphabet, but I was not able to settle it and stopped working on it since. I do not believe that other researchers have tried very hard to solve it (so this is not one of these very hard open problems, many people have already tried and failed; I consider it somehow doable). My guess would be that it is also NP-hard for fixed alphabets, but maybe the reduction is so complicated that you would get something like "MF is hard for alphabets of size 35 or larger" or something, which would not be super nice either.
Regarding further literature, I know the paper [4], which considers the problem of splitting a word w into distinct factors u_1, u_2, ..., u_k that are all palindromes, which is also NP-complete.
I had a quick look at paper [5], pointed out by Gilad. It seems to consider a different setting, though. In this paper, the authors are interested in the combinatorial question of how many distinct subsequences or subwords can be contained in a given word, but these can overlap. For example, aaabaab contains 20 different subwords a, b, aa, ab, ba, bb, aaa, aab, aba, baa, aaab, aaba, abaa, baab, aaaba, aabaa, abaab, aabaab, aaabaa, aaabaab (maybe I miscounted, but you get the idea). Some of them have only one occurrence, like baa, some of them several, like aa. In any case, the question is not how we can somehow split the word in order to get many distinct factors, since this means that each individual symbol contributes to exactly one factor.
Regarding practical solutions to these kind of problems (keep in mind that I am a theoretician, so take this with grain of salt):
To my knowledge, there are no theoretical lower bounds (like NP-hardness) that would rule it out to solve MF in polynomial time if we consider only input words over a fixed alphabet. There is one caveat, though: If you get a poly-time algorithm, then this should run exponentially in the number of symbols from the fixed alphabet (or exponential in some function of that)! Otherwise it would also be a polynomial time algorithm for the case of unbounded alphabets. So, being a theoretician, I would be looking for algorithmic tasks that can be computed in time exponential only if the number of symbols and that somehow help to devise an algorithm for MF. On the other hand, it is likely that such an algorithm does not exist and MF is also NP-hard in the fixed-alphabet case.
If you are interested in pratical solutions, it might be helpful to approximate the solution. So getting factorisation that are guaranteed to be only half as large as the optimum in the worst-case would not be too bad.
Heuristics that do not give a provable approximation ratio, but work well in a practical setting would also be interesting, I guess.
Transforming the problem instances into SAT or ILP-instances should not be too difficult and then you could run a SAT or ILP-Solver to even get optimal solutions.
My personal opinion is that even though it is not known whether the fixed-alphabet case of MF is NP-hard, there is enough theoretical insights that suggest that the problem is hard enough so that it is justified to look for heuristic solutions etc. that work well in a practical setting.
Bibliography:
[1] Anne Condon, Ján Manuch, Chris Thachuk: The complexity of string partitioning. J. Discrete Algorithms 32: 24-43 (2015)
[1b] Anne Condon, Ján Manuch, Chris Thachuk: Complexity of a Collision-Aware String Partition Problem and Its Relation to Oligo Design for Gene Synthesis. COCOON 2008: 265-275
[2] Henning Fernau, Florin Manea, Robert Mercas, Markus L. Schmid: Pattern Matching with Variables: Fast Algorithms and New Hardness Results. STACS 2015: 302-315
[3] Markus L. Schmid: Computing equality-free and repetitive string factorisations. Theor. Comput. Sci. 618: 42-51 (2016)
[4] Hideo Bannai, Travis Gagie, Shunsuke Inenaga, Juha Kärkkäinen, Dominik Kempa, Marcin Piatkowski, Shiho Sugimoto: Diverse Palindromic Factorization is NP-Complete. Int. J. Found. Comput. Sci. 29(2): 143-164 (2018)
[5] Abraham Flaxman, Aram Wettroth Harrow, Gregory B. Sorkin: Strings with Maximally Many Distinct Subsequences and Substrings. Electr. J. Comb. 11(1) (2004)
You can use a recursive function with a set as a second parameter to keep track of the unique strings in the current path so far. For each recursion, iterate through all the indices plus 1 at which to split the string for a possible candidate string, and if the candidate string is not yet in the set, make a recursive call with the remaining string and the candidate added to the set to get the maximum number of unique substrings from the remaining string, add 1 to it and return the maximum of the maximums from the iterations. Return 0 if either the given string is empty or all the candidate strings are already in the set:
def max_unique_substrings(s, seen=()):
maximum = 0
for i in range(1, len(s) + 1):
candidate = s[:i]
if candidate not in seen:
maximum = max(maximum, 1 + max_unique_substrings(s[i:], {candidate, *seen}))
return maximum
Demo: https://repl.it/@blhsing/PriceyScalySphere
In Python 3.8, the above logic can also be written with a call to the max
function with a generator expression that filters candidates that have been "seen" with an assignment expression:
def max_unique_substrings(s, seen=()):
return max((1 + max_unique_substrings(s[i:], {candidate, *seen}) for i in range(1, len(s) + 1) if (candidate := s[:i]) not in seen), default=0)
Here's a solution but it blows up really fast and is not anywhere near an efficient solution. It first breaks the string down into a list of unique substrings with no concern for ordering, then attempts to use itertools.permutation to reassemble those substrings back into the original string, testing EACH permutation to see if it matches the original string.
import itertools as it
def splitter(seq):
temp = [seq]
for x in range(1, len(seq)):
print(seq[:x], seq[x:])
temp.append(seq[:x])
temp.append(seq[x:])
return temp
if __name__ == "__main__":
test = input("Enter a string: ")
temp = splitter(test)
copy = temp[::]
condition = True
for x in temp:
if len(x) > 1:
copy.extend(splitter(x))
copy = sorted(list(set(copy)))
print(copy)
count = []
for x in range(len(test)):
item = it.permutations(copy, x)
try:
while True:
temp = next(item)
if "".join(list(temp)) == test:
if len(temp) == len(set(temp)):
count.append((len(temp), temp))
except StopIteration:
print('next permutation begin iteration')
continue
print(f"All unique splits: {count}")
print(f"Longest unique split : {max(count)[0]}")
For the first test we get this:
All unique splits: [(1, ('aababaa',)), (2, ('a', 'ababaa')), (2, ('aa', 'babaa')), (2,
('aab', 'abaa')), (2, ('aaba', 'baa')), (2, ('aabab', 'aa')), (2, ('aababa', 'a')), (3,
('a', 'ab', 'abaa')), (3, ('a', 'aba', 'baa')), (3, ('a', 'abab', 'aa')), (3, ('aa', 'b',
'abaa')), (3, ('aa', 'ba', 'baa')), (3, ('aa', 'baba', 'a')), (3, ('aab', 'a', 'baa')),
(3, ('aab', 'ab', 'aa')), (3, ('aab', 'aba', 'a')), (3, ('aaba', 'b', 'aa')), (3,
('aaba', 'ba', 'a')), (4, ('a', 'aba', 'b', 'aa')), (4, ('aa', 'b', 'a', 'baa')), (4,
('aa', 'b', 'aba', 'a')), (4, ('aab', 'a', 'b', 'aa'))]
Longest unique split : 4
Perhaps this can be optimized somehow, but that takes quite a few seconds on this machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With