Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find top 10 search terms

People also ask

What is the best searching algorithm?

Binary search algorithm works on the principle of divide & conquer and it is considered the best searching algorithms because of its faster speed to search ( Provided the data is in sorted form). A binary search is also known as a half-interval search or logarithmic search.

What is the search algorithm of Google?

The Google Search Algorithm refers to the process Google uses to rank content. It takes into account hundreds of factors, including keyword mentions, usability, and backlinks. Sidenote. Google has multiple search algorithms all working together to return the best results.


Frequency Estimation Overview

There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.

The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.

Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).

Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).

The Frequent Algorithm

Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.

Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.

  1. Look at each item in the stream.
  2. If the term exists in the map, increment the associated counter.
  3. Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
  4. However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.

Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.

Counting Searches

In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.

Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.


Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.

Useful book in this area: Muthukrishnan - "Data Streams: Algorithms and Applications"

Closely related reference to the problem at hand which I picked from the above: Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]

By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.

Heh, nice interview question.


This is one of the research project that I am current going through. The requirement is almost exactly as yours, and we have developed nice algorithms to solve the problem.

The Input

The input is an endless stream of English words or phrases (we refer them as tokens).

The Output

  1. Output top N tokens we have seen so far (from all the tokens we have seen!)
  2. Output top N tokens in a historical window, say, last day or last week.

An application of this research is to find the hot topic or trends of topic in Twitter or Facebook. We have a crawler that crawls on the website, which generates a stream of words, which will feed into the system. The system then will output the words or phrases of top frequency either at overall or historically. Imagine in last couple of weeks the phrase "World Cup" would appears many times in Twitter. So does "Paul the octopus". :)

String into Integers

The system has an integer ID for each word. Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower. We have already found 4 million different words, and assigned a unique ID for each. This whole set of data can be loaded into memory as a hash table, consuming roughly 300MB memory. (We have implemented our own hash table. The Java's implementation takes huge memory overhead)

Each phrase then can be identified as an array of integers.

This is important, because sorting and comparisons on integers is much much faster than on strings.

Archive Data

The system keeps archive data for every token. Basically it's pairs of (Token, Frequency). However, the table that stores the data would be so huge such that we have to partition the table physically. Once partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram. And this goes on. Roughly at 4gram we have 1 billion records, with table sized at around 60GB.

Processing Incoming Streams

The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.

For highly frequent tokens, they are always kept in memory. For less frequent tokens, they are sorted based on IDs (remember we translate the String into an array of integers), and serialized into a disk file.

(However, for your problem, since you are counting only words, then you can put all word-frequency map in memory only. A carefully designed data structure would consume only 300MB memory for 4 million different words. Some hint: use ASCII char to represent Strings), and this is much acceptable.

Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort. Some design need to be taken care at here as well, since we want to avoid too many random disk seeks. The idea is to avoid read (merge process)/write (system output) at the same time, and let the merge process read form one disk while writing into a different disk. This is similar like to implementing a locking.

End of Day

At end of day, the system will have many frequent tokens with frequency stored in memory, and many other less frequent tokens stored in several disk files (and each file is sorted).

The system flush the in-memory map into a disk file (sort it). Now, the problem becomes merging a set of sorted disk file. Using similar process, we would get one sorted disk file at the end.

Then, the final task is to merge the sorted disk file into archive database. Depends on the size of archive database, the algorithm works like below if it is big enough:

   for each record in sorted disk file
        update archive database by increasing frequency
        if rowcount == 0 then put the record into a list
   end for

   for each record in the list of having rowcount == 0
        insert into archive database
   end for

The intuition is that after sometime, the number of inserting will become smaller and smaller. More and more operation will be on updating only. And this updating will not be penalized by index.

Hope this entire explanation would help. :)


You could use a hash table combined with a binary search tree. Implement a <search term, count> dictionary which tells you how many times each search term has been searched for.

Obviously iterating the entire hash table every hour to get the top 10 is very bad. But this is google we're talking about, so you can assume that the top ten will all get, say over 10 000 hits (it's probably a much larger number though). So every time a search term's count exceeds 10 000, insert it in the BST. Then every hour, you only have to get the first 10 from the BST, which should contain relatively few entries.

This solves the problem of top-10-of-all-time.


The really tricky part is dealing with one term taking another's place in the monthly report (for example, "stack overflow" might have 50 000 hits for the past two months, but only 10 000 the past month, while "amazon" might have 40 000 for the past two months but 30 000 for the past month. You want "amazon" to come before "stack overflow" in your monthly report). To do this, I would store, for all major (above 10 000 all-time searches) search terms, a 30-day list that tells you how many times that term was searched for on each day. The list would work like a FIFO queue: you remove the first day and insert a new one each day (or each hour, but then you might need to store more information, which means more memory / space. If memory is not a problem do it, otherwise go for that "approximation" they're talking about).

This looks like a good start. You can then worry about pruning the terms that have > 10 000 hits but haven't had many in a long while and stuff like that.


case i)

Maintain a hashtable for all the searchterms, as well as a sorted top-ten list separate from the hashtable. Whenever a search occurs, increment the appropriate item in the hashtable and check to see if that item should now be switched with the 10th item in the top-ten list.

O(1) lookup for the top-ten list, and max O(log(n)) insertion into the hashtable (assuming collisions managed by a self-balancing binary tree).

case ii) Instead of maintaining a huge hashtable and a small list, we maintain a hashtable and a sorted list of all items. Whenever a search is made, that term is incremented in the hashtable, and in the sorted list the term can be checked to see if it should switch with the term after it. A self-balancing binary tree could work well for this, as we also need to be able to query it quickly (more on this later).

In addition we also maintain a list of 'hours' in the form of a FIFO list (queue). Each 'hour' element would contain a list of all searches done within that particular hour. So for example, our list of hours might look like this:

Time: 0 hours
      -Search Terms:
          -free stuff: 56
          -funny pics: 321
          -stackoverflow: 1234
Time: 1 hour
      -Search Terms:
          -ebay: 12
          -funny pics: 1
          -stackoverflow: 522
          -BP sucks: 92

Then, every hour: If the list has at least 720 hours long (that's the number of hours in 30 days), look at the first element in the list, and for each search term, decrement that element in the hashtable by the appropriate amount. Afterwards, delete that first hour element from the list.

So let's say we're at hour 721, and we're ready to look at the first hour in our list (above). We'd decrement free stuff by 56 in the hashtable, funny pics by 321, etc., and would then remove hour 0 from the list completely since we will never need to look at it again.

The reason we maintain a sorted list of all terms that allows for fast queries is because every hour after as we go through the search terms from 720 hours ago, we need to ensure the top-ten list remains sorted. So as we decrement 'free stuff' by 56 in the hashtable for example, we'd check to see where it now belongs in the list. Because it's a self-balancing binary tree, all of that can be accomplished nicely in O(log(n)) time.


Edit: Sacrificing accuracy for space...

It might be useful to also implement a big list in the first one, as in the second one. Then we could apply the following space optimization on both cases: Run a cron job to remove all but the top x items in the list. This would keep the space requirement down (and as a result make queries on the list faster). Of course, it would result in an approximate result, but this is allowed. x could be calculated before deploying the application based on available memory, and adjusted dynamically if more memory becomes available.


Rough thinking...

For top 10 all time

  • Using a hash collection where a count for each term is stored (sanitize terms, etc.)
  • An sorted array which contains the ongoing top 10, a term/count in added to this array whenever the count of a term becomes equal or greater than the smallest count in the array

For monthly top 10 updated hourly:

  • Using an array indexed on number of hours elapsed since start modulo 744 (the number of hours during a month), which array entries consist of hash collection where a count for each term encountered during this hour-slot is stored. An entry is reset whenever the hour-slot counter changes
  • the stats in the array indexed on hour-slot needs to be collected whenever the current hour-slot counter changes (once an hour at most), by copying and flattening the content of this array indexed on hour-slots

Errr... make sense? I didn't think this through as I would in real life

Ah yes, forgot to mention, the hourly "copying/flattening" required for the monthly stats can actually reuse the same code used for the top 10 of all time, a nice side effect.


Exact solution

First, a solution that guarantees correct results, but requires a lot of memory (a big map).

"All-time" variant

Maintain a hash map with queries as keys and their counts as values. Additionally, keep a list f 10 most frequent queries so far and the count of the 10th most frequent count (a threshold).

Constantly update the map as the stream of queries is read. Every time a count exceeds the current threshold, do the following: remove the 10th query from the "Top 10" list, replace it with the query you've just updated, and update the threshold as well.

"Past month" variant

Keep the same "Top 10" list and update it the same way as above. Also, keep a similar map, but this time store vectors of 30*24 = 720 count (one for each hour) as values. Every hour do the following for every key: remove the oldest counter from the vector add a new one (initialized to 0) at the end. Remove the key from the map if the vector is all-zero. Also, every hour you have to calculate the "Top 10" list from scratch.

Note: Yes, this time we're storing 720 integers instead of one, but there are much less keys (the all-time variant has a really long tail).

Approximations

These approximations do not guarantee the correct solution, but are less memory-consuming.

  1. Process every N-th query, skipping the rest.
  2. (For all-time variant only) Keep at most M key-value pairs in the map (M should be as big as you can afford). It's a kind of an LRU cache: every time you read a query that is not in the map, remove the least recently used query with count 1 and replace it with the currently processed query.