Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count words in an nltk plaintextcorpus faster?

I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I can learn better coding, but also so that I can get this project done faster!

def searchText(searchword):
    counts = []
    corpus_root = 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts
like image 741
Mark Bellhorn Avatar asked Oct 10 '10 20:10

Mark Bellhorn


People also ask

How do you count words in nltk?

So in Python using the nltk module, we can tokenize strings either into words or sentences. We then simply use the len() function to find the number of words or sentences in the string.

How do you count words in NLP?

After tokenising a text, the first figure we can calculate is the word frequency. By word frequency we indicate the number of times each token occurs in a text. When talking about word frequency, we distinguish between types and tokens.

How do you count words in Python?

Python Code:def word_count(str): counts = dict() words = str. split() for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts print( word_count('the quick brown fox jumps over the lazy dog. '))


1 Answers

If you just want a frequency of word counts, then you don't need to create nltk.Text objects, or even use nltk.PlainTextReader. Instead, just go straight to nltk.FreqDist.

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

Or, if you don't want to do any analysis - just use a dict.

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

These could be made much more efficient with generator expressions, but I'm used for loops for readability.

like image 102
Tim McNamara Avatar answered Sep 22 '22 01:09

Tim McNamara