Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Design an algorithm, find the most frequently used word in a book

An interview question:

Find the most frequently used word in a book.

My idea:

Use a hash table, traverse and mark the hash table.

If the book's size is known, if any word is found to be used > 50%, then skip any new words in the following traversal and only count old words. What if the book size is unknown?

It is O(n) and O(n) time and space.

Any better ideas?

Thanks

like image 556
user1002288 Avatar asked Jan 06 '12 17:01

user1002288


2 Answers

To determine complexity I think you need to consider two variables, n = total number of words, m = number of unique words. I imagine the best case complexity will come out close to O(n log(m)) for speed, and O(m) for storage, assuming each time you iterate over each of n words, and build and search based on a hash table or other such structure which eventually contains m elements.

like image 74
SmacL Avatar answered Nov 13 '22 09:11

SmacL


This is actually a classic example of map reduce.

The example in the wikipedia page will give you the word count of each unique word, but you can easily add a step in the reduce step that keeps track of the current most common word(with some kind of mutex to deal with concurrency issues).

If you have a distributed cluster of machines or a highly parallelized computer this will run much faster than using the hash table.

like image 2
thekoalaz Avatar answered Nov 13 '22 09:11

thekoalaz