I've been going through Skiena's excellent "The Algorithm Design Manual" and got hung up on one of the exercises.
The question is: "Given a search string of three words, find the smallest snippet of the document that contains all three of the search words—i.e. , the snippet with smallest number of words in it. You are given the index positions where these words in occur search strings, such as word1: (1, 4, 5), word2: (4, 9, 10), and word3: (5, 6, 15). Each of the lists are in sorted order, as above."
Anything I come up with is O(n^2)... This question is in the "Sorting and Searching" chapter, so I assume there is a simple and clever way to do it. I'm trying something with graphs right now, but that seems like overkill.
Ideas? Thanks
Unless I've overlooked something, here's a simple, O(n) algorithm:
Running time - in each iteration either x or y is increased by 1, clearly x can't exceed y and y can't exceed string length so total number of iterations is O(n). Also, feasibility can be checked at O(1) in this case since we can track how many occurences of each word are within the current snippet. We can maintain this count at O(1) with each increase of x or y by 1.
Correctness - For each x, we calculate the minimal feasible snippet (x, ?). Thus we must go over the minimal snippet. Also, if y is the smallest y such that (x, y) is feasible then if (x+1, y') is a feasible snippet y' >= y (This bit is why this algorithm is linear and the others aren't).
I already posted a rather straightforward algorithm that solves exactly that problem in this answer
Google search results: How to find the minimum window that contains all the search keywords?
However, in that question we assumed that the input is represented by a text stream and the words are stored in an easily searchable set.
In your case the input is represented slightly differently: as a bunch of vectors with sorted positions for each word. This representation is easily transformable to what is needed for the above algorithm by simply merging all these vectors into a single vector of (position, word)
pairs ordered by position. It can be done literally, or it can be done "virtually", by placing the original vectors into the priority queue (ordered in accordance with their first elements). Popping an element from the queue in this case means popping the first element from the first vector in the queue and possibly sinking the first vector into the queue in accordance with its new first element.
Of course, since your statement of the problem explicitly fixes the number of words as three, you can simply check the first elements of all three arrays and pop the smallest one at each iteration. That gives you a O(N)
algorithm, where N
is the total length of all arrays.
Also, your statement of the problem seems to suggest that target words can overlap in the text, which is rather strange (given that you use the term "word"). Is it intentional? In any case, it doesn't present any problem for the above linked algorithm.
From the question, it seems that you're given the index locations for each of your n “search words” (word1, word2, word3, ..., word n) in the document. Using a sorting algorithm, the n independent arrays associated with search words can readily be represented as a single array of all the index locations in ascending numerical order and a word label associated with each index in the array (the index array).
The Basic Algorithm:
(Designed to work whether or not the poster of this question intended to allow two different search words to coexist at the same index number.)
First, we define a simple function for measuring the length of a snippet that contains all n labels given a starting point in the index array. (It is obvious from the definition of our array that any starting point on the array will necessarily be the indexed location of one of the n search labels.) The function simply keeps track of the unique search labels seen as the function iterates through the elements in the array until all n labels have been observed. The length of the snippet is defined as the difference between the index of the last unique label found and the index of the starting point in the index array (the first unique label found). If all n labels aren't observed before the end of the array the function returns a null value.
Now, the snippet length function can be run for each element in your array to associate a snippet size containing all n search words starting from each element in the array. The smallest non-Null value returned by the snippet length function over the whole index array is the snippet in your document that you're looking for.
Necessary Optimizations:
Computational Complexity:
Obviously the sorting part of the algorithm can be arranged in O(n log n).
Here's how I would work out the time complexity of the second part of the algorithm (any critiques and corrections would be greatly appreciated).
In the best case scenario, the algorithm only applies the snippet length function to the first element in the index array and finds that no snippet containing all the search words exists. This scenario would be computed in just n calculations where n is the size of the index array. Slightly worse than that is if the smallest snippet turns out to be equal to the size of the whole array. In this case the computational complexity will be a little less than 2 n (once through the array to find the smallest snippet length, a second time to demonstrate that no other snippets exist). The shorter the average computed snippet length, the more times the snippet length function will need to be applied over the index array. We can assume that our worse case scenario will be the case where the snippet length function needs to be applied to every element in the index array. To develop a case where the function will be applied to every element in the index array we need to design an index array where the average snippet length over the whole index array is negligible in comparison to the size of the index array as a whole. Using this case we can write out our computational complexity as O(C n) where C is some constant that is significantly smaller then n. Giving a final computational complexity of:
O(n log n + C n)
Where:
C << n
Edit:
AndreyT correctly points out that instead of sorting the word indicies in n log n time, one might just as well merge them (since the sub arrays are already sorted) in n log m time where m is the amount of search word arrays to be merged. This will obviously speed up the algorithm is cases where m < n.
O(n log k) solution, where n is the total number of indices and k is the number of words. The idea is to use a heap to identify the smallest index at each iteration, while also keeping track of the maximum index in the heap. I also put the coordinates of each value in the heap, in order to be able to retrieve the next value in constant time.
#include <algorithm>
#include <cassert>
#include <limits>
#include <queue>
#include <vector>
using namespace std;
int snippet(const vector< vector<int> >& index) {
// (-index[i][j], (i, j))
priority_queue< pair< int, pair<size_t, size_t> > > queue;
int nmax = numeric_limits<int>::min();
for (size_t i = 0; i < index.size(); ++i) {
if (!index[i].empty()) {
int cur = index[i][0];
nmax = max(nmax, cur);
queue.push(make_pair(-cur, make_pair(i, 0)));
}
}
int result = numeric_limits<int>::max();
while (queue.size() == index.size()) {
int nmin = -queue.top().first;
size_t i = queue.top().second.first;
size_t j = queue.top().second.second;
queue.pop();
result = min(result, nmax - nmin + 1);
j++;
if (j < index[i].size()) {
int next = index[i][j];
nmax = max(nmax, next);
queue.push(make_pair(-next, make_pair(i, j)));
}
}
return result;
}
int main() {
int data[][3] = {{1, 4, 5}, {4, 9, 10}, {5, 6, 15}};
vector<vector<int> > index;
for (int i = 0; i < 3; i++) {
index.push_back(vector<int>(data[i], data[i] + 3));
}
assert(snippet(index) == 2);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With