Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

  3. After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

like image 779
Morgan Cheng Avatar asked Oct 09 '08 02:10

Morgan Cheng


People also ask

How would you find the K most frequent words from a string words?

Hash all words one by one in a hash table. If a word is already present, then increment its count. Finally, traverse through the hash table and return the k words with maximum counts. We can use Trie and Min Heap to get the k most frequent words efficiently.

How do I find the most frequent words in a file?

This can be done by opening a file in read mode using file pointer. Read the file line by line. Split a line at a time and store in an array. Iterate through the array and find the frequency of each word and compare the frequency with maxcount.


1 Answers

This can be done in O(n) time

Solution 1:

Steps:

  1. Count words and hash it, which will end up in the structure like this

    var hash = {   "I" : 13,   "like" : 3,   "meow" : 3,   "geek" : 3,   "burger" : 2,   "cat" : 1,   "foo" : 100,   ...   ... 
  2. Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

  3. Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

      0   1      2            3                  100 [[ ],[cat],[burger],[like, meow, geek],[]...[foo]] 
  4. Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

  1. Same as above
  2. Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
  3. After traversing through the array, we just convert the min heap to array and return the array.
like image 60
Chihung Yu Avatar answered Oct 05 '22 12:10

Chihung Yu