Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find high frequency words in a book in an environment low on memory?

Tags:

text

frequency

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.

How do you make this operation less memory intensive? Any strategies/solutions?

-Snehal

like image 967
Snehal Avatar asked Apr 12 '09 18:04

Snehal


1 Answers

You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).

The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.

like image 186
aleemb Avatar answered Oct 03 '22 07:10

aleemb