Most frequent words in a terabyte of data

Question

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.

One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem. I thought of another solution using Map-Reduce by splitting the chunks on different nodes. Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.

Which one of the above would be a better solution? I think the first solution is pretty naive.

Jirka Hanika · Accepted Answer

Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.

When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)

Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.

This is slower but simpler to implement than my other answer.

Ari · Answer

The best I could think of:

Split data to parts you can store in memory.
For each part get N most frequent words, you will get N * partsNumber words.
Read all data again counting words you got before.

It won't always give you correct answer, but it will work in fixed memory and linear time.

Most frequent words in a terabyte of data

Tags:

algorithm

bigdata

Akshay

2 Answers

Jirka Hanika

Ari

Recent Activity

Donate For Us

Most frequent words in a terabyte of data

Tags:

algorithm

bigdata

Akshay

2 Answers

Jirka Hanika

Ari

Related questions

Recent Activity

Donate For Us