I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I can learn better coding, but also so that I can get this project done faster!
def searchText(searchword):
counts = []
corpus_root = 'some_dir'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
for id in wordlists.fileids():
date = id[4:12]
month = date[-4:-2]
day = date[-2:]
year = date[:4]
raw = wordlists.raw(id)
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
count = text.count(searchword)
counts.append((month, day, year, count))
return counts
So in Python using the nltk module, we can tokenize strings either into words or sentences. We then simply use the len() function to find the number of words or sentences in the string.
After tokenising a text, the first figure we can calculate is the word frequency. By word frequency we indicate the number of times each token occurs in a text. When talking about word frequency, we distinguish between types and tokens.
Python Code:def word_count(str): counts = dict() words = str. split() for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts print( word_count('the quick brown fox jumps over the lazy dog. '))
If you just want a frequency of word counts, then you don't need to create nltk.Text
objects, or even use nltk.PlainTextReader
. Instead, just go straight to nltk.FreqDist
.
files = list_of_files
fd = nltk.FreqDist()
for file in files:
with open(file) as f:
for sent in nltk.sent_tokenize(f.lower()):
for word in nltk.word_tokenize(sent):
fd.inc(word)
Or, if you don't want to do any analysis - just use a dict
.
files = list_of_files
fd = {}
for file in files:
with open(file) as f:
for sent in nltk.sent_tokenize(f.lower()):
for word in nltk.word_tokenize(sent):
try:
fd[word] = fd[word]+1
except KeyError:
fd[word] = 1
These could be made much more efficient with generator expressions, but I'm used for loops for readability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With