Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the number of occurrences of a subsequence in a string

For example, let the string be the first 10 digits of pi, 3141592653, and the subsequence be 123. Note that the sequence occurs twice:

3141592653  1    2  3    1  2  3 

This was an interview question that I couldn't answer and I can't think of an efficient algorithm and it's bugging me. I feel like it should be possible to do with a simple regex, but ones like 1.*2.*3 don't return every subsequence. My naive implementation in Python (count the 3's for each 2 after each 1) has been running for an hour and it's not done.

like image 650
Jake Avatar asked Jul 29 '11 18:07

Jake


People also ask

How do you find the number of subsequences in a string?

Given a string, find the count of distinct subsequences of it. Try It! The problem of counting distinct subsequences is easy if all characters of input string are distinct. The count is equal to nC0 + nC1 + nC2 + … nCn = 2n.

How many times S2 occurs in S1?

Since the string S2 occurs twice as a non-overlapping subsequence in S1, the required output is 2.

How do I generate all subsequences of a string in Python?

A String is a subsequence of a given String, that is generated by deleting some character of a given string without changing its order. Recommended: Please try your approach on {IDE} first, before moving on to the solution. Using Pick and Don't Pick concept : Python3.

What is subsequence of a string example?

A subsequence of a string is a new string that is formed from the original string by deleting some (can be none) of the characters without disturbing the relative positions of the remaining characters. (i.e., "ace" is a subsequence of "abcde" while "aec" is not).


2 Answers

This is a classical dynamic programming problem (and not typically solved using regular expressions).

My naive implementation (count the 3's for each 2 after each 1) has been running for an hour and it's not done.

That would be an exhaustive search approach which runs in exponential time. (I'm surprised it runs for hours though).


Here's a suggestion for a dynamic programming solution:

Outline for a recursive solution:

(Apologies for the long description, but each step is really simple so bear with me ;-)

  • If the subsequence is empty a match is found (no digits left to match!) and we return 1

  • If the input sequence is empty we've depleted our digits and can't possibly find a match thus we return 0

  • (Neither the sequence nor the subsequence are empty.)

  • (Assume that "abcdef" denotes the input sequence, and "xyz" denotes the subsequence.)

  • Set result to 0

  • Add to the result the number of matches for bcdef and xyz (i.e., discard the first input digit and recurse)

  • If the first two digits match, i.e., a = x

    • Add to the result the number of matches for bcdef and yz (i.e., match the first subsequence digit and recurse on the remaining subsequence digits)

  • Return result


Example

Here's an illustration of the recursive calls for input 1221 / 12. (Subsequence in bold font, · represents empty string.)

enter image description here


Dynamic programming

If implemented naively, some (sub-)problems are solved multiple times (· / 2 for instance in the illustration above). Dynamic programming avoids such redundant computations by remembering the results from previously solved subproblems (usually in a lookup table).

In this particular case we set up a table with

  • [length of sequence + 1] rows, and
  • [length of subsequence + 1] columns:

          enter image description here

The idea is that we should fill in the number of matches for 221 / 2 in the corresponding row / column. Once done, we should have the final solution in cell 1221 / 12.

We start populating the table with what we know immediately (the "base cases"):

  • When no subsequence digits are left, we have 1 complete match:

          enter image description here

  • When no sequence digits are left, we can't have any matches:

    enter image description here

We then proceed by populating the table top-down / left-to-right according to the following rule:

  • In cell [row][col] write the value found at [row-1][col].

    Intuitively this means "The number of matches for 221 / 2 includes all the matches for 21 / 2."

  • If sequence at row row and subseq at column col start with the same digit, add the value found at [row-1][col-1] to the value just written to [row][col].

    Intuitively this means "The number of matches for 1221 / 12 also includes all the matches for 221 / 12."

          enter image description here

The final result looks as follows:

          enter image description here

and the value at the bottom right cell is indeed 2.


In Code

Not in Python, (my apologies).

class SubseqCounter {      String seq, subseq;     int[][] tbl;      public SubseqCounter(String seq, String subseq) {         this.seq = seq;         this.subseq = subseq;     }      public int countMatches() {         tbl = new int[seq.length() + 1][subseq.length() + 1];          for (int row = 0; row < tbl.length; row++)             for (int col = 0; col < tbl[row].length; col++)                 tbl[row][col] = countMatchesFor(row, col);          return tbl[seq.length()][subseq.length()];     }      private int countMatchesFor(int seqDigitsLeft, int subseqDigitsLeft) {         if (subseqDigitsLeft == 0)             return 1;          if (seqDigitsLeft == 0)             return 0;          char currSeqDigit = seq.charAt(seq.length()-seqDigitsLeft);         char currSubseqDigit = subseq.charAt(subseq.length()-subseqDigitsLeft);          int result = 0;          if (currSeqDigit == currSubseqDigit)             result += tbl[seqDigitsLeft - 1][subseqDigitsLeft - 1];          result += tbl[seqDigitsLeft - 1][subseqDigitsLeft];          return result;     } } 

Complexity

A bonus for this "fill-in-the-table" approach is that it is trivial to figure out complexity. A constant amount of work is done for each cell, and we have length-of-sequence rows and length-of-subsequence columns. Complexity is therefor O(MN) where M and N denote the lengths of the sequences.

like image 173
aioobe Avatar answered Sep 23 '22 06:09

aioobe


Great answer, aioobe! To complement your answer, some possible implementations in Python:

1) straightforward, naïve solution; too slow!

def num_subsequences(seq, sub):     if not sub:         return 1     elif not seq:         return 0     result = num_subsequences(seq[1:], sub)     if seq[0] == sub[0]:         result += num_subsequences(seq[1:], sub[1:])     return result 

2) top-down solution using explicit memoization

def num_subsequences(seq, sub):     m, n, cache = len(seq), len(sub), {}     def count(i, j):         if j == n:             return 1         elif i == m:             return 0         k = (i, j)         if k not in cache:             cache[k] = count(i+1, j) + (count(i+1, j+1) if seq[i] == sub[j] else 0)         return cache[k]     return count(0, 0) 

3) top-down solution using the lru_cache decorator (available from functools in python >= 3.2)

from functools import lru_cache  def num_subsequences(seq, sub):     m, n = len(seq), len(sub)     @lru_cache(maxsize=None)     def count(i, j):         if j == n:             return 1         elif i == m:             return 0         return count(i+1, j) + (count(i+1, j+1) if seq[i] == sub[j] else 0)     return count(0, 0) 

4) bottom-up, dynamic programming solution using a lookup table

def num_subsequences(seq, sub):     m, n = len(seq)+1, len(sub)+1     table = [[0]*n for i in xrange(m)]     def count(iseq, isub):         if not isub:             return 1         elif not iseq:             return 0         return (table[iseq-1][isub] +                (table[iseq-1][isub-1] if seq[m-iseq-1] == sub[n-isub-1] else 0))     for row in xrange(m):         for col in xrange(n):             table[row][col] = count(row, col)     return table[m-1][n-1] 

5) bottom-up, dynamic programming solution using a single array

def num_subsequences(seq, sub):     m, n = len(seq), len(sub)     table = [0] * n     for i in xrange(m):         previous = 1         for j in xrange(n):             current = table[j]             if seq[i] == sub[j]:                 table[j] += previous             previous = current     return table[n-1] if n else 1 
like image 25
Óscar López Avatar answered Sep 24 '22 06:09

Óscar López