I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I can learn better coding, but also so that I can get this project done faster! <pre class="prettyprint"><code>def searchText(searchword): counts = [] corpus_root = 'some_dir' wordlists = PlaintextCorpusReader(corpus_root, '.*') for id in wordlists.fileids(): date = id[4:12] month = date[-4:-2] day = date[-2:] year = date[:4] raw = wordlists.raw(id) tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) count = text.count(searchword) counts.append((month, day, year, count)) return counts </code></pre>

If you just want a frequency of word counts, then you don't need to create <code>nltk.Text</code> objects, or even use <code>nltk.PlainTextReader</code>. Instead, just go straight to <code>nltk.FreqDist</code>. <pre class="prettyprint"><code>files = list_of_files fd = nltk.FreqDist() for file in files: with open(file) as f: for sent in nltk.sent_tokenize(f.lower()): for word in nltk.word_tokenize(sent): fd.inc(word) </code></pre> Or, if you don't want to do any analysis - just use a <code>dict</code>. <pre class="prettyprint"><code>files = list_of_files fd = {} for file in files: with open(file) as f: for sent in nltk.sent_tokenize(f.lower()): for word in nltk.word_tokenize(sent): try: fd[word] = fd[word]+1 except KeyError: fd[word] = 1 </code></pre> These could be made much more efficient with generator expressions, but I'm used for loops for readability.

How do I count words in an nltk plaintextcorpus faster?

Tags:

python

nlp

nltk

corpus

I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I can learn better coding, but also so that I can get this project done faster!

def searchText(searchword):
    counts = []
    corpus_root = 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts

741

asked Oct 10 '10 20:10

Mark Bellhorn

1 Answers

If you just want a frequency of word counts, then you don't need to create nltk.Text objects, or even use nltk.PlainTextReader. Instead, just go straight to nltk.FreqDist.

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

Or, if you don't want to do any analysis - just use a dict.

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

These could be made much more efficient with generator expressions, but I'm used for loops for readability.

102

answered Sep 22 '22 01:09

Tim McNamara

Related questions
                            
                                from ... import * with __import__ function [duplicate]
                            
                                Best practices for doing accounting in Python
                            
                                beautifulsoup, Find th with text 'price', then get price from next th
                            
                                Dump data from django Feincms
                            
                                Help me understand why my trivial use of Python's ctypes module is failing
                            
                                Is it good design to create a module-wide logger in python?
                            
                                Is there a way to iterate a specified number of times without introducing an unnecessary variable?
                            
                                Matrix multiplication with Numpy
                            
                                Module vs object-oriented programming in vba
                            
                                Execute a python command within vim and getting the output
                            
                                Is it bad practice to use self in decorators?
                            
                                How do I add basic authentication to a Python REST request?
                            
                                Crop non symmetric area of an image with Python/PIL
                            
                                Does thread-local mean thread safe?
                            
                                Creating a website to communicate with an embedded device
                            
                                How to declare the welcome file (e.g. index.html) in app.yaml
                            
                                Trying to group values?
                            
                                Neural network library for Python? [closed]
                            
                                Is there a way to move many files quickly in Python?
                            
                                Is it possible for SymPy to render LaTeX for use in a GUI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With