Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing and Using NLTK corpus

Tags:

python

nltk

Please, please, please help. I have a folder filled with text files that I want to use NLTK to analyze. How do I import that as a corpus and then run NLTK commands on it? I've put together the code below but it's giving me this error:

    raise error, v # invalid expression
sre_constants.error: nothing to repeat

Code:

import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus_root = '/Users/jt/Documents/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '*.txt')

print "Finished importing corpus" 

words = FreqDist()

for sentence in speeches.sents():
    for word in sentence:
        words.inc(word.lower())

print words["he"]
print words.freq("he")
like image 318
Jolijt Tamanaha Avatar asked Jan 20 '26 11:01

Jolijt Tamanaha


1 Answers

I understand this problem has to do with a known bug (maybe it's a feature?), which is partially explained in this answer. In short, certain regexes about empty things blow up.

The source of the error is you speeches = line. You should change it to the following:

speeches = PlaintextCorpusReader(corpus_root, r'.*\.txt')

Then everything will load and compile just fine.

like image 147
davidlowryduda Avatar answered Jan 22 '26 04:01

davidlowryduda