Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tracking/Counting Word Frequency

I'd like to get some community consensus on a good design to be able to store and query word frequency counts. I'm building an application in which I have to parse text inputs and store how many times a word has appeared (over time). So given the following inputs:

  • "To Kill a Mocking Bird"
  • "Mocking a piano player"

Would store the following values:

Word    Count
-------------
To      1
Kill    1
A       2
Mocking 2
Bird    1
Piano   1
Player  1

And later be able to quickly query for the count value of a given arbitrary word.

My current plan is to simply store the words and counts in a database, and rely on caching word count values ... But I suspect that I won't get enough cache hits to make this a viable solution long term.

Can anyone suggest algorithms, or data structures, or any other idea that might make this a well-performing solution?

like image 776
Joel Martinez Avatar asked May 17 '10 20:05

Joel Martinez


People also ask

How does word frequency affect word recognition?

When word recognition is analyzed, frequency of occurrence is one of the strongest predictors of processing efficiency. High-frequency words are known to more people and are processed faster than low-frequency words (the word frequency effect; Monsell, Doyle, & Haggard, 1989).

How do you count frequency in NLP?

One of the key steps in NLP or Natural Language Process is the ability to count the frequency of the terms used in a text document or table. To achieve this we must tokenize the words so that they represent individual objects that can be counted. There are a great set of libraries that you can use to tokenize words.

How do you find the accuracy of a word count?

Click on Word Count.Select Word Count from the Tools menu dropdown. A box displaying the number of words, characters, lines, pages and paragraphs will appear on the screen. The word count for a selected portion of text will usually be displayed in the bottom bar of your document.


1 Answers

Word counting is the canonical example of a MapReduce program (pseudo code from Wikipedia):

void map(String name, String document):
  for each word w in document:
     EmitIntermediate(w, "1");

void reduce(String word, Iterator partialCounts):
  int result = 0;
  for each pc in partialCounts:
    result += ParseInt(pc);
  Emit(AsString(result));

I am not saying that this is the way to do it, but it is definitely an option if you need something that scales well when the number of distinct words outsizes the memory available on a single machine. As long as you are able to stay below the memory limit, a simple loop updating a hash table should do the trick.

like image 185
Jørn Schou-Rode Avatar answered Nov 12 '22 19:11

Jørn Schou-Rode