Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count words in a corpus document

Tags:

python

nltk

I want to know the best way to count words in a document. If I have my own "corp.txt" corpus setup and I want to know how frequently "students, trust, ayre" occur in the file "corp.txt". What could I use?

Would it be one of the following:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

Thanks, Ray

like image 799
Ray Hmar Avatar asked Dec 01 '22 01:12

Ray Hmar


1 Answers

I would suggest looking into collections.Counter. Especially for large amounts of text, this does the trick and is only limited by the available memory. It counted 30 billions tokens in a day and a half on a computer with 12Gb of ram. Pseudocode (variable Words will in practice be some reference to a file or similar):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

When finished the words are in a dictionary my_counter which then can be written to disk or stored elsewhere (sqlite for example).

like image 158
Lars GJ Avatar answered Dec 05 '22 01:12

Lars GJ