Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding k most common words in a file - memory usage

Let's say you are given a huge file, say 1GB. The file contains a word on each line (total n words), and you want to find the k most frequent terms in the file.

Now, assuming you have enough memory to store these words, what is the better way to approach the question in terms of reducing the memory usage and the constant overhead in the Big-O complexity? I believe there are two basic algorithms one can use:

  1. Use a hash table and a min-heap to store the occurrences and the top K words seen. This is O(n + nlogk) ~ O(N)
  2. Use a trie to store the words and occurrences and then traverse the trie to count the most frequent words. This is O(n*p) ~ O(N) where p is the length of the longest word.

Which is a better approach?

Also: if you didn't have enough memory for a hash table/trie (i.e. limited memory of 10MB or so), then what is the best approach?

like image 666
user1921187 Avatar asked Dec 21 '12 09:12

user1921187


People also ask

How to find the most frequent k words in the text?

Output: The most frequent K words in the text. My thinking is like this. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time. sort the (word, word-frequency) pair; and the key is "word-frequency".

How to find 10 most frequent words in a data set?

If there is a need to find 10 most frequent words in a data set, python can help us find it using the collections module. The collections module has a counter class which gives the count of the words after we supply a list of words to it. We also use the most_common method to find out the number of such words as needed by the program input.

How to find the K most frequent words in a hash table?

Finally, traverse through the hash table and return the k words with maximum counts. We can use Trie and Min Heap to get the k most frequent words efficiently. The idea is to use Trie for searching existing words adding new words efficiently. Trie also stores count of occurrences of words.

How to find most commonly used word in a text file?

But in case, you need to find most commonly used word from quite a large file of size like 1 GB or more than using Numpy, Pandas Python library would be a better option. I hope that this article helped you to understand How Python can be used for finding most commonly used word in a Text File.


1 Answers

Which is more efficient regarding the constant is very dependent. On one hand, trie offers strict O(N) time complexity for inserting all elements, while the hash table might decay to quadric time on worst case.
On the other hand, tries are not very efficient when it comes to cache - each seek requires O(|S|) random access memory requests, which might cause the performance to decay significantly.

Both approaches are valid, and I think there are multiple considerations that should be taken when choosing one over the other like maximum latency (if it is a real time system), throughput, and time to develop.

If average case performance is all that matters, I'd suggest to generate a bunch of files and run statistical analysis which approach is better. Wilcoxon signed test is the de-facto state of the art hypothesis test in use.


Regarding embedded systems: both approaches are still valid, but in here: Each "Node" (or bunch of nodes) in the trie will be on disk rather then on RAM. Note that it means for the trie O(|S|) random access disk seeks per entry, which might be slowish.

For hashing solutions, you have 10MB, let's say they can use 5MB out of these for hash table of pointers to disk. Let's also assume you can store 500 different disk addresses on these 5MB (pessimistic analysis here), that means that you have 5MB left to load a bucket after each hash seek, and if you have 500 buckets, with load factor of 0.5, it means you can store 500 * 5MB * 0.5 ~= 1.25GB > 1GB of you data, thus using the hash table solution, so using hashing - each seek will need only O(1) random disk seeks in order to find the bucket containing the relevant string.

Note that if it still is not enough, we can rehash the pointers tables, very similar to what is being done in the paging table in the virtual memory mechanism.

From this we can conclude, for embedded systems, the hash solution is better for most cases (note it might still suffer from high latency on worst cases, no silver bullet here).


PS, radix tree is usually faster and more compact then trie, but suffers from the same side effects of trie comparing to hash tables (though less significant, of course).

like image 134
amit Avatar answered Oct 26 '22 13:10

amit