Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classification using movie review corpus in NLTK/Python

I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt), but I would prefer to create directories I could dump files into.

from nltk.corpus import movie_reviews

reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']

documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

all_words=nltk.FreqDist(
    w.lower() 
    for w in movie_reviews.words() 
    if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in  string.punctuation)
word_features = all_words.keys()[:100]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features
print document_features(movie_reviews.words('pos/11.txt'))

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

This returns:

File "test.py", line 38, in <module>
    for w in movie_reviews.words()

File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
    self, self._resolve(fileids, categories))

File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
    in self.abspaths(fileids, True, True)])

File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
    raise ValueError('concat() expects at least one object!')

ValueError: concat() expects at least one object!

---------UPDATE------------- Thanks alvas for your detailed answer! I have two questions, however.

  1. Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the review_pos.txt method, only grabbing the pos from the folder name rather than the file name.
  2. I ran your code and am experiencing a syntax error on

    train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

with the carrot under the first for. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.

----UPDATE 2---- Error is

File "review.py", line 17
  for i in word_features}, tag)
    ^
SyntaxError: invalid syntax`
like image 385
user3128184 Avatar asked Jan 14 '14 06:01

user3128184


People also ask

What is NLTK corpus in Python?

The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus format.

What is NLTK and how it is useful in processing NLP text analysis?

NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.

How do you find the sentiment of a sentence in Python?

Obtaining Sentiment Score in Python using TextBlob In the third line, the sentiment function is utilized, which returns two properties: polarity and subjectivity. The polarity of the statement is 0.0 in the above report, suggesting that the sentiment is neutral.


1 Answers

Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what's available in NLTK and what's not. So let's go through the problems one at a time.

Firstly, the way to get 'pos' / 'neg' documents through the directory is most probably the right thing to do, since the corpus was organized that way.

from nltk.corpus import movie_reviews as mr
from collections import defaultdict

documents = defaultdict(list)

for i in mr.fileids():
    documents[i.split('/')[0]].append(i)

print documents['pos'][:10] # first ten pos reviews.
print
print documents['neg'][:10] # first ten neg reviews.

[out]:

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt']

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']

Alternatively, I like a list of tuples where the first is element is the list of words in the .txt file and second is the category. And while doing so also remove the stopwords and punctuations:

from nltk.corpus import movie_reviews as mr
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

Next is the error at FreqDist(for w in movie_reviews.words() ...). There is nothing wrong with your code, just that you should try to use namespace (see http://en.wikipedia.org/wiki/Namespace#Use_in_common_languages). The following code:

from nltk.corpus import movie_reviews as mr
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
stop = stopwords.words('english')

all_words = FreqDist(w.lower() for w in mr.words() if w.lower() not in stop and w.lower() not in string.punctuation)

print all_words

[outputs]:

<FreqDist: 'film': 9517, 'one': 5852, 'movie': 5771, 'like': 3690, 'even': 2565, 'good': 2411, 'time': 2411, 'story': 2169, 'would': 2109, 'much': 2049, ...>

Since the above code prints the FreqDist correctly, the error seems like you do not have the files in nltk_data/ directory.

The fact that you have fic/11.txt suggests that you're using some older version of the NLTK or NLTK corpora. Normally the fileids in movie_reviews, starts with either pos/neg then a slash then the filename and finally .txt , e.g. pos/cv001_18431.txt.

So I think, maybe you should redownload the files with:

$ python
>>> import nltk
>>> nltk.download()

Then make sure that the movie review corpus is properly downloaded under the corpora tab:

MR dl

Back to the code, looping through all the words in the movie review corpus seems redundant if you already have all the words filtered in your documents, so i would rather do this to extract all featureset:

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]

Next, splitting the train/test by features is okay but i think it's better to use documents, so instead of this:

featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
train_set, test_set = featuresets[100:], featuresets[:100]

I would recommend this instead:

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

Then feed the data into the classifier and voila! So here's the code without the comments and walkthrough:

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

[out]:

0.655
Most Informative Features
                     bad = True              neg : pos    =      2.0 : 1.0
                  script = True              neg : pos    =      1.5 : 1.0
                   world = True              pos : neg    =      1.5 : 1.0
                 nothing = True              neg : pos    =      1.5 : 1.0
                     bad = False             pos : neg    =      1.5 : 1.0
like image 116
alvas Avatar answered Oct 08 '22 04:10

alvas