This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers. The first and unfortunately only thing that pops up in my mind use to use a <code>std::map</code>. I know fellow C++'ers will say that <code>unordered_map</code> would be so much reasonable. I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement Any ideas?

The best data structure to use for this task is a Trie: http://en.wikipedia.org/wiki/Trie It will outperform a hash table for counting strings.

There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a <code>HashMap</code>would not be an efficient way to do it. Here are some things which you can do depending on your problem: <ol> <li>If you know that the number of unique words are very limited, you can use a <code>TreeMap</code> or in your case <code>std::map</code>.</li> <li>If the number of words are very large then you can build a <code>trie</code> and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size <code>n</code>. So you don't need to store all the words, just the necessary ones.</li> </ol>

Algorithm: A Better Way To Calculate Frequencies of a list of words

Tags:

c++

performance

algorithm

data-structures

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.

The first and unfortunately only thing that pops up in my mind use to use a std::map. I know fellow C++'ers will say that unordered_map would be so much reasonable.

I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement

Any ideas?

527

asked Apr 17 '12 23:04

Ali

2 Answers

The best data structure to use for this task is a Trie:

http://en.wikipedia.org/wiki/Trie

It will outperform a hash table for counting strings.

answered Sep 22 '22 00:09

Andrew Tomazos

There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:

If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don't need to store all the words, just the necessary ones.

answered Sep 22 '22 00:09

noMAD

Related questions
                            
                                How do I disable GCC optimization when using makefiles?
                            
                                Why do two functions have the same address?
                            
                                Process va_args in c++
                            
                                Convert from MinGW .a to VC++ .lib
                            
                                Print string without escaping characters
                            
                                C++11: Why is assigning rvalues allowed?
                            
                                Ubuntu 11.10 linking perftools library
                            
                                C++ - ofstream doesn't output to file until I close the program
                            
                                Is this a VC++2010 compiler bug?
                            
                                why std::make_pair is getting input by value instead of by const reference?
                            
                                Uniform initialization with {} reporting unused variable
                            
                                Error: passing const xxx as this argument of xxx discards qualifiers
                            
                                Unclear typedef type
                            
                                Does a friend see base classes?
                            
                                Parsing parameters to main()
                            
                                How to compose all QtTestLib unit tests' results in a single file while using a single test project?
                            
                                C++ - the fastest integer type?
                            
                                Use of `ofstream` appears not to create nor write to file
                            
                                compiler optimization of return value in VS 2010
                            
                                Eclipse CDT complains about unresolved functions but still builds successfully

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With