Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for checking if a string was built from a list of substrings

You are given a string and an array of strings. How to quickly check, if this string can be built by concatenating some of the strings in the array?

This is a theoretical question, I don't need it for practical reasons. But I would like to know, if there is some good algorithm for this.

EDIT Reading some answer I have noticed, that this is probably NP-Complete problem. Even finding a subset of strings, that will together have same length, as a given string is a classic subset sum problem.

So I guess there is no easy answer to this.

EDIT

Now it seems, that it is not a NP-Complete problem after all. That's way cooler :-)

EDIT

I have came up with a solution that passes some tests:

def can_build_from_substrings(string, substrings):
    prefixes = [True] + [False] * (len(string) - 1)
    while True:
        old = list(prefixes)
        for s in substrings:
            for index, is_set in enumerate(prefixes):
                if is_set and string[index:].startswith(s):
                    if string[index:] == s:
                        return True
                    prefixes[index + len(s)] = True
        if old == prefixes: # nothing has changed in this iteration
            return False

I believe the time is O(n * m^3), where n is length of substrings and m is length of string. What do you think?

like image 236
gruszczy Avatar asked May 13 '11 19:05

gruszczy


4 Answers

Note: I assume here that you can use each substring more than once. You can generalize the solution to include this restriction by changing how we define subproblems. That will have a negative impact on space as well as expected runtime, but the problem remains polynomial.

This is a dynamic programming problem. (And a great question!)

Let's define composable(S, W) to be true if the string S can be written using the list of substrings W.

S is composable if and only if:

  1. S starts with a substring w in W.
  2. The remainder of S after w is also composable.

Let's write some pseudocode:

COMPOSABLE(S, W):
  return TRUE if S = "" # Base case
  return memo[S] if memo[S]

  memo[S] = false

  for w in W:
    length <- LENGTH(w)
    start  <- S[1..length]
    rest   <- S[length+1..-1]
    if start = w AND COMPOSABLE(rest, W) :
      memo[S] = true # Memoize

  return memo[S]

This algorithm has O(m*n) runtime, assuming the length of the substrings is not linear w/r/t to the string itself, in which case runtime would be O(m*n^2) (where m is the size of the substring list and n is the length of the string in question). It uses O(n) space for memoization.

(N.B. as written the pseudocode uses O(n^2) space, but hashing the memoization keys would alleviate this.)

EDIT

Here is a working Ruby implementation:

def composable(str, words)
  composable_aux(str, words, {})
end

def composable_aux(str, words, memo)
  return true if str == ""                # The base case
  return memo[str] unless memo[str].nil?  # Return the answer if we already know it

  memo[str] = false              # Assume the answer is `false`

  words.each do |word|           # For each word in the list:
    length = word.length
    start  = str[0..length-1]
    rest   = str[length..-1]

    # If the test string starts with this word,
    # and the remaining part of the test string
    # is also composable, the answer is true.
    if start == word and composable_aux(rest, words, memo)
      memo[str] = true           # Mark the answer as true
    end
  end

  memo[str]                      # Return the answer
end
like image 73
Paul Rosania Avatar answered Oct 13 '22 17:10

Paul Rosania


It's definitely not quick but you here's an idea:

  • Iterate over all the strings, checking if the target string "begins" with any of them
  • Take the longest string with which the target string begins, remove it from the list and trim it from the main string
  • Rinse, repeat

Stop when you're left with a 0 length target string.

As I said before, this is definitely not fast but should give you a baseline ("it shouldn't get much worse than this").

EDIT

As pointed out in the comments, this will not work. You will have to store the partial matches and fall back on them when you find there is no way further.

  • When you find that a string is the head of the target, push it onto a list. After building the list, you will naturally try the biggest "head" of the target
  • When you find that the head you tried doesn't fit with what's left, try the next best head

This way, eventually you'll explore the entire space of solutions. For every candidate head you'll try every possible tail.

like image 29
cnicutar Avatar answered Oct 13 '22 16:10

cnicutar


This is how I would do it.

  1. Determine the length of the target string.
  2. Determine the length of each string in the substring array
  3. Determine which combination of substrings would yield a string with the same length as the target string (If any, if not you're done)
  4. Generate all permutations of the substring combinations determined in step 3. Check if any of them match the target string.

Generating all permutations is a processor heavy task, so if you can cut down on your 'n' (input size), you'll gain some considerable efficiency.

like image 38
Localghost Avatar answered Oct 13 '22 15:10

Localghost


Inspired by @cnicutars answer:

  • function Possible(array A, string s)
    • If s is empty, return true.
    • compute the array P of all strings in A that are a prefix of s.
    • If P is empty, return false.
    • for each string p in P:
      • if Possible(A with p removed, s with prefix p removed) return true
    • return false
like image 41
Anders Lindahl Avatar answered Oct 13 '22 17:10

Anders Lindahl