How to find minimal-length subsequence that contains all element of a sequence

Q: How do you find the number of subsequences in an array?

Given a string, find the count of distinct subsequences of it. The problem of counting distinct subsequences is easy if all characters of input string are distinct. The count is equal to nC0 + nC1 + nC2 + … nCn = 2n.

Tags:

algorithm

Given a sequence such as S = {1,8,2,1,4,1,2,9,1,8,4}, I need to find the minimal-length subsequence that contains all element of S (no duplicates, order does not matter). How do find this subsequence in an efficient way?

Note: There are 5 distinct elements in S: {1,2,4,8,9}. The minimum-length subsequence must contain all these 5 elements.

406

asked Aug 01 '11 09:08

russell

2 Answers

Algorithm:

First, determine the quantity of different elements in the array - this can be easily done in linear time. Let there be k different elements.

Allocate an array cur of size 10^5, each showing how much of each element is used in current subsequence (see later).

Hold a cnt variable showing how many different elements are there currently in the considered sequence. Now, take two indexes, begin and end and iterate them through the array the following way:

initialize cnt and begin as 0, end as -1 (to get 0 after first increment). Then while possible perform follows:
If cnt != k:

2.1. increment end. If end already is the end of array, then break. If cur[array[end]] is zero, increment cnt. Increment cur[array[end]].

Else:

2.2 {

Try to increment the begin iterator: while cur[array[begin]] > 1, decrement it, and increment the begin (cur[array[begin]] > 1 means that we have another such element in our current subsequence). After all, compare the [begin, end] interval with current answer and store it if it is better.

}

After the further process becomes impossible, you got the answer. The complexity is O(n) - just passing two interators through the array.

Implementation in C++:

    #include <iostream>

using namespace std;

const int MAXSIZE = 10000;

int arr[ MAXSIZE ];
int cur[ MAXSIZE ];

int main ()
{
   int n; // the size of array
   // read n and the array

   cin >> n;
   for( int i = 0; i < n; ++i )
      cin >> arr[ i ];

   int k = 0;
   for( int i = 0; i < n; ++i )
   {
      if( cur[ arr[ i ] ] == 0 )
         ++k;
      ++cur[ arr[ i ] ];
   }

   // now k is the number of distinct elements

   memset( cur, 0, sizeof( cur )); // we need this array anew
   int begin = 0, end = -1; // to make it 0 after first increment
   int best = -1; // best answer currently found
   int ansbegin, ansend; // interval of the best answer currently found
   int cnt = 0; // distinct elements in current subsequence

   while(1)
   {
      if( cnt < k )
      {
         ++end;
         if( end == n )
            break;
         if( cur[ arr[ end ]] == 0 )
            ++cnt; // this elements wasn't present in current subsequence;
         ++cur[ arr[ end ]];
         continue;
      }
      // if we're here it means that [begin, end] interval contains all distinct elements
      // try to shrink it from behind
      while( cur[ arr[ begin ]] > 1 ) // we have another such element later in the subsequence
      {
         --cur[ arr[ begin ]];
         ++begin;
      }
      // now, compare [begin, end] with the best answer found yet
      if( best == -1 || end - begin < best )
      {
         best = end - begin;
         ansbegin = begin;
         ansend = end;
      }
      // now increment the begin iterator to make cur < k and begin increasing the end iterator again
      --cur[ arr[ begin]];
      ++begin;
      --cnt;
   }

   // output the [ansbegin, ansend] interval as it's the answer to the problem

   cout << ansbegin << ' ' << ansend << endl;
   for( int i = ansbegin; i <= ansend; ++i )
      cout << arr[ i ] << ' ';
   cout << endl;

   return 0;
}

109

answered Nov 16 '22 03:11

Grigor Gevorgyan

This can be solved by dynamic programming.

At each step k, we'll compute the shortest subsequence that ends at the k-th position of S and that satisfies the requirement of containing all the unique elements of S.

Given the solution to step k (hereinafter "the sequence"), computing the solution to step k+1 is easy: append the (k+1)-th element of S to the sequence and then remove, one by one, all elements at the start of the sequence that are contained in the extended sequence more than once.

The solution to the overall problem is the shortest sequence found in any of the steps.

The initialization of the algorithm consists of two stages:

Scan S once, building the alphabet of unique values.
Find the shortest valid sequence whose first element is the first element of S; the last position of this sequence will be the initial value of k.

All of the above can be done in O(n logn) worst-case time (let me know if this requires clarification).

Here is a complete implementation of the above algorithm in Python:

import collections

S = [1,8,2,1,4,1,2,9,1,8,4,2,4]

# initialization: stage 1
alphabet = set(S)                         # the unique values ("symbols") in S
count = collections.defaultdict(int)      # how many times each symbol appears in the sequence

# initialization: stage 2
start = 0
for end in xrange(len(S)):
  count[S[end]] += 1
  if len(count) == len(alphabet):         # seen all the symbols yet?
    break
end += 1

best_start = start
best_end = end

# the induction
while end < len(S):
  count[S[end]] += 1
  while count[S[start]] > 1:
    count[S[start]] -= 1
    start += 1
  end += 1
  if end - start < best_end - best_start: # new shortest sequence?
    best_start = start
    best_end = end

print S[best_start:best_end]

Notes:

the data structures I use (dictionaries and sets) are based on hash tables; they have good average-case performance but can degrade to O(n) in the worst case. If it's the worst case that you care about, replacing them with tree-based structures will give the overall O(n logn) I've promised above;
as pointed out by @biziclop, the first scan of S can be eliminated, making the algorithm suitable for streaming data;
if the elements of S are small non-negative integers, as your comments indicate, then count can be flattened out into an integer array, bringing the overall complexity down to O(n).

answered Nov 16 '22 02:11

NPE

Related questions
                            
                                Looking for fast algorithm to find distance between two nodes in binary tree
                            
                                Autofocus algorithm for USB microscope
                            
                                C move memory parts inplace
                            
                                Linear / Non-Linear Fit to a Sine Curve
                            
                                Improve performance of 7-sided die roll simulation from a 6-sided die implementation
                            
                                Predict next event occurrence, based on past occurrences
                            
                                O(NlogN) finding 3 numbers that have a sum of any arbitrary T in an array
                            
                                How to break down a given text into words from the dictionary?
                            
                                O(n^2) isn't fast enough in solving this. any faster approaches?
                            
                                Is it good practice to fake an inserter?
                            
                                Why is the deque solution to the "Sliding Window Maximum" problem O(n) instead of O(nk)?
                            
                                Spaced repetition (SRS) for learning
                            
                                F# FSharpMap vs Dictionary performance
                            
                                Compressed Graph Representation?
                            
                                Upper bound vs lower bound for worst case running time of an algorithm
                            
                                Is A-star guaranteed to give the shortest path in a 2D grid
                            
                                Tests pass when run individually but not when the whole test class run
                            
                                Differences between backtracking and brute-force search
                            
                                Finding contiguous ranges in arrays
                            
                                Finding an element in partially sorted array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With