Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)

I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.

My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.

What approach would be best?

like image 753
erjiang Avatar asked Dec 29 '22 12:12

erjiang


2 Answers

I think a trie with the count as the leaves could be faster.

Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.

A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.

For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.

If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.

However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.

like image 89
Ben S Avatar answered Jan 13 '23 23:01

Ben S


I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.

like image 28
Robert H. Avatar answered Jan 13 '23 23:01

Robert H.