Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google search results: How to find the minimum window that contains all the search keywords?

Tags:

algorithm

What is the complexity of the algorithm is that is used to find the smallest snippet that contains all the search key words?

like image 956
Boolean Avatar asked Apr 29 '10 01:04

Boolean


3 Answers

I think the solution proposed by AndreyT assumes no duplicates exists in the keywords/search terms. Also, the current block can get as big as the text itself if text contains lot of duplicate keywords. For example: Text: 'ABBBBBBBBBB' Keyword text: 'AB' Current Block: 'ABBBBBBBBBB'

Anyway, I have implemented in C#, did some basic testing, would be nice to get some feedback on whether it works or not :)

    static string FindMinWindow(string text, string searchTerms)
    {
        Dictionary<char, bool> searchIndex = new Dictionary<char, bool>();
        foreach (var item in searchTerms)
        {
            searchIndex.Add(item, false);
        }

        Queue<Tuple<char, int>> currentBlock = new Queue<Tuple<char, int>>();
        int noOfMatches = 0;
        int minLength = Int32.MaxValue;
        int startIndex = 0;
        for(int i = 0; i < text.Length; i++)
        {
            char item = text[i];
            if (searchIndex.ContainsKey(item))
            {
                if (!searchIndex[item])
                {
                    noOfMatches++;
                }

                searchIndex[item] = true;
                var newEntry = new Tuple<char, int> ( item, i );
                currentBlock.Enqueue(newEntry);

                // Normalization step.
                while (currentBlock.Count(o => o.Item1.Equals(currentBlock.First().Item1)) > 1)
                {
                    currentBlock.Dequeue();
                }

                // Figuring out minimum length.
                if (noOfMatches == searchTerms.Length)
                {
                    var length = currentBlock.Last().Item2 - currentBlock.First().Item2 + 1;
                    if (length < minLength)
                    {
                        startIndex = currentBlock.First().Item2;
                        minLength = length;
                    }
                }
            }
        }
        return noOfMatches == searchTerms.Length ? text.Substring(startIndex, minLength) : String.Empty;
    }
like image 156
Raquibul Hasan Jewel Avatar answered Nov 06 '22 14:11

Raquibul Hasan Jewel


As stated, the problem is solved by a rather simple algorithm:

Just look through the input text sequentially from the very beginning and check each word: whether it is in the search key or not. If the word is in the key, add it to the end of the structure that we will call The Current Block. The Current Block is just a linear sequence of words, each word accompanied by a position at which it was found in the text. The Current Block must maintain the following Property: the very first word in The Current Block must be present in The Current Block once and only once. If you add the new word to the end of The Current Block, and the above property becomes violated, you have to remove the very first word from the block. This process is called normalization of The Current Block. Normalization is a potentially iterative process, since once you remove the very first word from the block, the new first word might also violate The Property, so you'll have to remove it as well. And so on.

So, basically The Current Block is a FIFO sequence: the new words arrive at the right end, and get removed by normalization process from the left end.

All you have to do to solve the problem is look through the text, maintain The Current Block, normalizing it when necessary so that it satisfies The Property. The shortest block with all the keywords in it you ever build is the answer to the problem.

For example, consider the text

CxxxAxxxBxxAxxCxBAxxxC

with keywords A, B and C. Looking through the text you'll build the following sequence of blocks

C
CA
CAB - all words, length 9 (CxxxAxxxB...)
CABA - all words, length 12 (CxxxAxxxBxxA...)
CABAC - violates The Property, remove first C
ABAC - violates The Property, remove first A
BAC - all words, length 7 (...BxxAxxC...)
BACB - violates The Property, remove first B
ACB - all words, length 6 (...AxxCxB...)
ACBA - violates The Property, remove first A
CBA - all words, length 4 (...CxBA...)
CBAC - violates The Property, remove first C
BAC - all words, length 6 (...BAxxxC)

The best block we built has length 4, which is the answer in this case

CxxxAxxxBxxAxx CxBA xxxC

The exact complexity of this algorithm depends on the input, since it dictates how many iterations the normalization process will make, but ignoring the normalization the complexity would trivially be O(N * log M), where N is the number of words in the text and M is the number of keywords, and O(log M) is the complexity of checking whether the current word belongs to the keyword set.

Now, having said that, I have to admit that I suspect that this might not be what you need. Since you mentioned Google in the caption, it might be that the statement of the problem you gave in your post is not complete. Maybe in your case the text is indexed? (With indexing the above algorithm is still applicable, just becomes more efficient). Maybe there's some tricky database that describes the text and allows for a more efficient solution (like without looking through the entire text)? I can only guess and you are not saying...

like image 31
AnT Avatar answered Nov 06 '22 14:11

AnT


This is an interesting question. To restate it more formally: Given a list L (the web page) of length n and a set S (the query) of size k, find the smallest sublist of L that contains all the elements of S.

I'll start with a brute-force solution in hopes of inspiring others to beat it. Note that set membership can be done in constant time, after one pass through the set. See this question. Also note that this assumes all the elements of S are in fact in L, otherwise it will just return the sublist from 1 to n.

best = (1,n)
For i from 1 to n-k:  
  Create/reset a hash found[] mapping each element of S to False.
  For j from i to n or until counter == k:  
    If found[L[j]] then counter++ and let found[L[j]] = True;
  If j-i < best[2]-best[1] then let best = (i,j).

Time complexity is O((n+k)(n-k)). Ie, n^2-ish.

like image 1
dreeves Avatar answered Nov 06 '22 13:11

dreeves