Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most frequent words in a terabyte of data

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.

One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem. I thought of another solution using Map-Reduce by splitting the chunks on different nodes. Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.

Which one of the above would be a better solution? I think the first solution is pretty naive.

like image 212
Akshay Avatar asked Sep 21 '12 06:09

Akshay


2 Answers

Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.

When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)

Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.

This is slower but simpler to implement than my other answer.

like image 138
Jirka Hanika Avatar answered Oct 31 '22 17:10

Jirka Hanika


The best I could think of:

  1. Split data to parts you can store in memory.
  2. For each part get N most frequent words, you will get N * partsNumber words.
  3. Read all data again counting words you got before.

It won't always give you correct answer, but it will work in fixed memory and linear time.

like image 2
Ari Avatar answered Oct 31 '22 18:10

Ari