Classification using movie review corpus in NLTK/Python

Tags:

I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt), but I would prefer to create directories I could dump files into.

from nltk.corpus import movie_reviews

reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']

documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

all_words=nltk.FreqDist(
    w.lower() 
    for w in movie_reviews.words() 
    if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in  string.punctuation)
word_features = all_words.keys()[:100]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features
print document_features(movie_reviews.words('pos/11.txt'))

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

This returns:

File "test.py", line 38, in <module>
    for w in movie_reviews.words()

File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
    self, self._resolve(fileids, categories))

File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
    in self.abspaths(fileids, True, True)])

File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
    raise ValueError('concat() expects at least one object!')

ValueError: concat() expects at least one object!

---------UPDATE------------- Thanks alvas for your detailed answer! I have two questions, however.

Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the review_pos.txt method, only grabbing the pos from the folder name rather than the file name.
I ran your code and am experiencing a syntax error on

train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

with the carrot under the first for. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.

----UPDATE 2---- Error is

File "review.py", line 17
  for i in word_features}, tag)
    ^
SyntaxError: invalid syntax`

385

asked Jan 14 '14 06:01

user3128184

1 Answers

Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what's available in NLTK and what's not. So let's go through the problems one at a time.

Firstly, the way to get 'pos' / 'neg' documents through the directory is most probably the right thing to do, since the corpus was organized that way.

from nltk.corpus import movie_reviews as mr
from collections import defaultdict

documents = defaultdict(list)

for i in mr.fileids():
    documents[i.split('/')[0]].append(i)

print documents['pos'][:10] # first ten pos reviews.
print
print documents['neg'][:10] # first ten neg reviews.

[out]:

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt']

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']

Alternatively, I like a list of tuples where the first is element is the list of words in the .txt file and second is the category. And while doing so also remove the stopwords and punctuations:

from nltk.corpus import movie_reviews as mr
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

Next is the error at FreqDist(for w in movie_reviews.words() ...). There is nothing wrong with your code, just that you should try to use namespace (see http://en.wikipedia.org/wiki/Namespace#Use_in_common_languages). The following code:

from nltk.corpus import movie_reviews as mr
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
stop = stopwords.words('english')

all_words = FreqDist(w.lower() for w in mr.words() if w.lower() not in stop and w.lower() not in string.punctuation)

print all_words

[outputs]:

<FreqDist: 'film': 9517, 'one': 5852, 'movie': 5771, 'like': 3690, 'even': 2565, 'good': 2411, 'time': 2411, 'story': 2169, 'would': 2109, 'much': 2049, ...>

Since the above code prints the FreqDist correctly, the error seems like you do not have the files in nltk_data/ directory.

The fact that you have fic/11.txt suggests that you're using some older version of the NLTK or NLTK corpora. Normally the fileids in movie_reviews, starts with either pos/neg then a slash then the filename and finally .txt , e.g. pos/cv001_18431.txt.

So I think, maybe you should redownload the files with:

$ python
>>> import nltk
>>> nltk.download()

Then make sure that the movie review corpus is properly downloaded under the corpora tab:

MR dl

Back to the code, looping through all the words in the movie review corpus seems redundant if you already have all the words filtered in your documents, so i would rather do this to extract all featureset:

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]

Next, splitting the train/test by features is okay but i think it's better to use documents, so instead of this:

featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
train_set, test_set = featuresets[100:], featuresets[:100]

I would recommend this instead:

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

Then feed the data into the classifier and voila! So here's the code without the comments and walkthrough:

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

[out]:

0.655
Most Informative Features
                     bad = True              neg : pos    =      2.0 : 1.0
                  script = True              neg : pos    =      1.5 : 1.0
                   world = True              pos : neg    =      1.5 : 1.0
                 nothing = True              neg : pos    =      1.5 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0

116

answered Oct 08 '22 04:10

alvas

Related questions
                            
                                Efficiently generate all composite numbers less than N (with their factorizations)
                            
                                how to send my game made with pygame to others?
                            
                                Python -- import a module from a directory that's not a package
                            
                                Can yield produce multiple consecutive generators?
                            
                                How to kill Django runserver sub processes from a bash script?
                            
                                Python large iterations number fail
                            
                                Scapy.all import * does not work
                            
                                How to parse .ttl files with RDFLib?
                            
                                Matplotlib stops animating after first frame
                            
                                Python: preg_replace function analog
                            
                                Python argparse: command-line argument that can be either named or positional
                            
                                Tkinter window focus on Mac OS X
                            
                                Speedup virtualenv creation with numpy and pandas
                            
                                Django runserver color output
                            
                                how to run task at scheduled time with RabbitMQ
                            
                                Python except None
                            
                                Is boto library thread-safe?
                            
                                Flask/SQLAlchemy error: TypeError: Incompatible collection type: [model] is not list-like
                            
                                Union find implementation using Python
                            
                                Is python's shutil.copyfile() atomic?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Classification using movie review corpus in NLTK/Python

Tags:

python

nlp

nltk

sentiment-analysis

corpus

user3128184

People also ask

1 Answers

alvas

Recent Activity

Donate For Us