Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm: A Better Way To Calculate Frequencies of a list of words

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.

The first and unfortunately only thing that pops up in my mind use to use a std::map. I know fellow C++'ers will say that unordered_map would be so much reasonable.

I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement

Any ideas?

like image 527
Ali Avatar asked Apr 17 '12 23:04

Ali


People also ask

What is used for finding the frequency of words in the given text?

Using FreqDist() Applying the most_common() gives us the frequency of each word.

What is frequency in algorithm?

The frequency sort algorithm is used to output elements of an array in descending order of their frequencies. If two elements have the same frequencies, then the element that occurs first in the input is printed first.


2 Answers

The best data structure to use for this task is a Trie:

http://en.wikipedia.org/wiki/Trie

It will outperform a hash table for counting strings.

like image 77
Andrew Tomazos Avatar answered Sep 22 '22 00:09

Andrew Tomazos


There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:

  1. If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
  2. If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don't need to store all the words, just the necessary ones.
like image 31
noMAD Avatar answered Sep 22 '22 00:09

noMAD