Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting the number of occurrences of words in a textfile

How could I go about keeping track of the number of times a word appears in a textfile? I would like to do this for every word.

For example, if the input is something like:

"the man said hi to the boy."

Each of "man said hi to boy" would have an occurrence of 1.

"the" would have an occurence of 2.

I was thinking of keeping a dictionary with word/occurrence pairs but I'm not sure how to implement this in C. A link to any similar or related problems with a solution would be great.


EDIT: To avoid rolling out my own hash table I decided to learn how to use glib. Along the way I found an excellent tutorial which walks through a similar problem. http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

I am awestruck by the number of different approaches, and in particular the simplicity and elegance of the Ruby implementation.

like image 975
vinc456 Avatar asked Dec 25 '08 22:12

vinc456


2 Answers

Yes, a dictionary with word-occurence pairs would work fine, and the usual way to implement such a dictionary would be to use a hash table (or, sometimes, a binary search tree).

You could also use a trie (or its compressed version, "Patricia trie"/Radix trie) whose complexity is asymptotically optimal for this problem, although I suspect that in practice it might be slower than a (good) hash table implementation.

[I really think whether hash tables or tries are better depends on the distribution of words in your input -- e.g. a hash table will need to store each word in its hash bucket (to guard against collisions), while if you have a lot of words with common prefixes, in a trie those common prefixes are shared and need to be stored only once each, but there is still the overhead of all the pointers... if you do happen to try both, I'm curious to know how they compare.]

like image 110
ShreevatsaR Avatar answered Sep 22 '22 13:09

ShreevatsaR


Just for the curious, here is a simple Ruby solution of the word count problem. It should be basically the same algorithm in C, just with a lot more code.

h = Hash.new(0)
File.read("filename.txt").split.each do |w|
  h[w] += 1
end
p h
like image 22
martinus Avatar answered Sep 24 '22 13:09

martinus