This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.
The first and unfortunately only thing that pops up in my mind use to use a std::map
. I know fellow C++'ers will say that unordered_map
would be so much reasonable.
I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement
Any ideas?
Using FreqDist() Applying the most_common() gives us the frequency of each word.
The frequency sort algorithm is used to output elements of an array in descending order of their frequencies. If two elements have the same frequencies, then the element that occurs first in the input is printed first.
The best data structure to use for this task is a Trie:
http://en.wikipedia.org/wiki/Trie
It will outperform a hash table for counting strings.
There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMap
would not be an efficient way to do it. Here are some things which you can do depending on your problem:
TreeMap
or in your case std::map
.trie
and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n
. So you don't need to store all the words, just the necessary ones.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With