I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.
One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem. I thought of another solution using Map-Reduce by splitting the chunks on different nodes. Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.
Which one of the above would be a better solution? I think the first solution is pretty naive.
Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.
When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)
Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.
This is slower but simpler to implement than my other answer.
The best I could think of:
N
most frequent words, you will get N * partsNumber
words.It won't always give you correct answer, but it will work in fixed memory and linear time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With