Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence. Output: The most frequent K words in the text. My thinking is like this. <ol> <li>use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time. </li> <li>sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm. </li> <li>After sorting, we just take the first K words. This takes O(K) time. </li> </ol> To summarize, the total time is O(n+nlg(n)+K)， Since K is surely smaller than N, so it is actually O(nlg(n)). We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be 2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap; 3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)). To summarize, this solution cost time O(n+k*lg(n)). This is just my thought. I haven't find out way to improve step 1). I Hope some Information Retrieval experts can shed more light on this question.

This can be done in O(n) time Solution 1: Steps: <ol> <li> Count words and hash it, which will end up in the structure like this <pre class="prettyprint"><code>var hash = { "I" : 13, "like" : 3, "meow" : 3, "geek" : 3, "burger" : 2, "cat" : 1, "foo" : 100, ... ... </code></pre> </li> <li>Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size</li> <li> Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like: <pre class="prettyprint"><code> 0 1 2 3 100 [[ ],[cat],[burger],[like, meow, geek],[]...[foo]] </code></pre> </li> <li>Then just traverse the array from the end, and collect the k words.</li> </ol> Solution 2: Steps: <ol> <li>Same as above</li> <li>Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.</li> <li>After traversing through the array, we just convert the min heap to array and return the array.</li> </ol>

The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence

Tags:

algorithm

word-frequency

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K)， Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

779

asked Oct 09 '08 02:10

Morgan Cheng

1 Answers

This can be done in O(n) time

Solution 1:

Steps:

Count words and hash it, which will end up in the structure like this

var hash = {   "I" : 13,   "like" : 3,   "meow" : 3,   "geek" : 3,   "burger" : 2,   "cat" : 1,   "foo" : 100,   ...   ...

Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
```
  0   1      2            3                  100 [[ ],[cat],[burger],[like, meow, geek],[]...[foo]] 
```
Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.

answered Oct 05 '22 12:10

Chihung Yu

Related questions
                            
                                "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?
                            
                                Choosing an attractive linear scale for a graph's Y Axis
                            
                                Sorting in Computer Science vs. sorting in the 'real' world
                            
                                How do I calculate the area of a 2d polygon?
                            
                                Finding the position of the maximum element
                            
                                Generating permutations lazily
                            
                                Why does FFT produce complex numbers instead of real numbers?
                            
                                What is the difference between LR(0) and SLR parsing?
                            
                                Find the Smallest Integer Not in a List
                            
                                How can I find the shortest path between 100 moving targets? (Live demo included.)
                            
                                How can Google be so fast?
                            
                                What is O(log* N)?
                            
                                How do I check if a directed graph is acyclic?
                            
                                What is amortized analysis of algorithms? [closed]
                            
                                Efficient way to search an element
                            
                                JavaScript: Calculate the nth root of a number
                            
                                Quick and Simple Hash Code Combinations
                            
                                Algorithm to check similarity of colors
                            
                                Fast prime factorization module
                            
                                Inverting a 4x4 matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With