Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find the smallest snippet from searching a document?

Tags:

algorithm

I've been going through Skiena's excellent "The Algorithm Design Manual" and got hung up on one of the exercises.

The question is: "Given a search string of three words, find the smallest snippet of the document that contains all three of the search words—i.e. , the snippet with smallest number of words in it. You are given the index positions where these words in occur search strings, such as word1: (1, 4, 5), word2: (4, 9, 10), and word3: (5, 6, 15). Each of the lists are in sorted order, as above."

Anything I come up with is O(n^2)... This question is in the "Sorting and Searching" chapter, so I assume there is a simple and clever way to do it. I'm trying something with graphs right now, but that seems like overkill.

Ideas? Thanks

like image 896
marathon Avatar asked Jun 02 '10 02:06

marathon


4 Answers

Unless I've overlooked something, here's a simple, O(n) algorithm:

  1. We'll represent the snippet by (x, y) where x and y are where the snippet begins and ends respectively.
  2. A snippet is feasible if it contains all 3 search words.
  3. We will start with the infeasible snippet (0,0).
  4. Repeat the following until y reaches end-of-string:
    1. If the current snippet (x, y) is feasible, proceed to the snippet (x+1, y)
      Else (the current snippet is infeasible) proceed to the snippet (x, y+1)
  5. Choose the shortest snippet among all feasible snippets we went through.

Running time - in each iteration either x or y is increased by 1, clearly x can't exceed y and y can't exceed string length so total number of iterations is O(n). Also, feasibility can be checked at O(1) in this case since we can track how many occurences of each word are within the current snippet. We can maintain this count at O(1) with each increase of x or y by 1.

Correctness - For each x, we calculate the minimal feasible snippet (x, ?). Thus we must go over the minimal snippet. Also, if y is the smallest y such that (x, y) is feasible then if (x+1, y') is a feasible snippet y' >= y (This bit is why this algorithm is linear and the others aren't).

like image 199
Yuval Cohen Avatar answered Nov 19 '22 18:11

Yuval Cohen


I already posted a rather straightforward algorithm that solves exactly that problem in this answer

Google search results: How to find the minimum window that contains all the search keywords?

However, in that question we assumed that the input is represented by a text stream and the words are stored in an easily searchable set.

In your case the input is represented slightly differently: as a bunch of vectors with sorted positions for each word. This representation is easily transformable to what is needed for the above algorithm by simply merging all these vectors into a single vector of (position, word) pairs ordered by position. It can be done literally, or it can be done "virtually", by placing the original vectors into the priority queue (ordered in accordance with their first elements). Popping an element from the queue in this case means popping the first element from the first vector in the queue and possibly sinking the first vector into the queue in accordance with its new first element.

Of course, since your statement of the problem explicitly fixes the number of words as three, you can simply check the first elements of all three arrays and pop the smallest one at each iteration. That gives you a O(N) algorithm, where N is the total length of all arrays.

Also, your statement of the problem seems to suggest that target words can overlap in the text, which is rather strange (given that you use the term "word"). Is it intentional? In any case, it doesn't present any problem for the above linked algorithm.

like image 23
AnT Avatar answered Nov 19 '22 18:11

AnT


From the question, it seems that you're given the index locations for each of your n “search words” (word1, word2, word3, ..., word n) in the document. Using a sorting algorithm, the n independent arrays associated with search words can readily be represented as a single array of all the index locations in ascending numerical order and a word label associated with each index in the array (the index array).

The Basic Algorithm:

(Designed to work whether or not the poster of this question intended to allow two different search words to coexist at the same index number.)

First, we define a simple function for measuring the length of a snippet that contains all n labels given a starting point in the index array. (It is obvious from the definition of our array that any starting point on the array will necessarily be the indexed location of one of the n search labels.) The function simply keeps track of the unique search labels seen as the function iterates through the elements in the array until all n labels have been observed. The length of the snippet is defined as the difference between the index of the last unique label found and the index of the starting point in the index array (the first unique label found). If all n labels aren't observed before the end of the array the function returns a null value.

Now, the snippet length function can be run for each element in your array to associate a snippet size containing all n search words starting from each element in the array. The smallest non-Null value returned by the snippet length function over the whole index array is the snippet in your document that you're looking for.

Necessary Optimizations:

  1. Keep track of the value of the current shortest snippet length so that the value will be know immediately after iterating once through the index array.
  2. When iterating through your array terminate the snippet length function if the current snippet under inspection ever surpasses the length of the shortest snippet length previously seen.
  3. When the snippet length function returns null for not locating all n search words in the remaining index array elements, associate a null snippet length to all successive elements in the index array.
  4. If the snippet length function is applied to a word label and the label immediately following it is identical to the starting label, assign a null value to the starting label and move on to the next label.

Computational Complexity:

Obviously the sorting part of the algorithm can be arranged in O(n log n).

Here's how I would work out the time complexity of the second part of the algorithm (any critiques and corrections would be greatly appreciated).

In the best case scenario, the algorithm only applies the snippet length function to the first element in the index array and finds that no snippet containing all the search words exists. This scenario would be computed in just n calculations where n is the size of the index array. Slightly worse than that is if the smallest snippet turns out to be equal to the size of the whole array. In this case the computational complexity will be a little less than 2 n (once through the array to find the smallest snippet length, a second time to demonstrate that no other snippets exist). The shorter the average computed snippet length, the more times the snippet length function will need to be applied over the index array. We can assume that our worse case scenario will be the case where the snippet length function needs to be applied to every element in the index array. To develop a case where the function will be applied to every element in the index array we need to design an index array where the average snippet length over the whole index array is negligible in comparison to the size of the index array as a whole. Using this case we can write out our computational complexity as O(C n) where C is some constant that is significantly smaller then n. Giving a final computational complexity of:

O(n log n + C n)

Where:

C << n

Edit:

AndreyT correctly points out that instead of sorting the word indicies in n log n time, one might just as well merge them (since the sub arrays are already sorted) in n log m time where m is the amount of search word arrays to be merged. This will obviously speed up the algorithm is cases where m < n.

like image 5
Jawad Avatar answered Nov 19 '22 18:11

Jawad


O(n log k) solution, where n is the total number of indices and k is the number of words. The idea is to use a heap to identify the smallest index at each iteration, while also keeping track of the maximum index in the heap. I also put the coordinates of each value in the heap, in order to be able to retrieve the next value in constant time.

#include <algorithm>
#include <cassert>
#include <limits>
#include <queue>
#include <vector>

using namespace std;

int snippet(const vector< vector<int> >& index) {
    // (-index[i][j], (i, j))
    priority_queue< pair< int, pair<size_t, size_t> > > queue;
    int nmax = numeric_limits<int>::min();
    for (size_t i = 0; i < index.size(); ++i) {
        if (!index[i].empty()) {
            int cur = index[i][0];
            nmax = max(nmax, cur);
            queue.push(make_pair(-cur, make_pair(i, 0)));
        }
    }
    int result = numeric_limits<int>::max();
    while (queue.size() == index.size()) {
        int nmin = -queue.top().first;
        size_t i = queue.top().second.first;
        size_t j = queue.top().second.second;
        queue.pop();
        result = min(result, nmax - nmin + 1);
        j++;
        if (j < index[i].size()) {
            int next = index[i][j];
            nmax = max(nmax, next);
            queue.push(make_pair(-next, make_pair(i, j)));
        }
    }
    return result;
}

int main() {
    int data[][3] = {{1, 4, 5}, {4, 9, 10}, {5, 6, 15}};
    vector<vector<int> > index;
    for (int i = 0; i < 3; i++) {
        index.push_back(vector<int>(data[i], data[i] + 3));
    }
    assert(snippet(index) == 2);
} 
like image 3
Kaue Silveira Avatar answered Nov 19 '22 17:11

Kaue Silveira