I've been having a problem understanding chi-squared feature selection. I have two classes, positive and negative, each containing different terms and term counts. I need to perform chi-squared feature selection to extract the most representative terms for each class. The problem is that I end up getting the EXACT same terms for both my positive and negative class. Here is my Python code for selecting features:
#!/usr/bin/python
# import the necessary libraries
import math
class ChiFeatureSelector:
def __init__(self, extCorpus, lookupCorpus):
# store the extraction corpus and lookup corpus
self.extCorpus = extCorpus
self.lookupCorpus = lookupCorpus
def select(self, outPath):
# dictionary of chi-squared scores
scores = {}
# loop over the words in the extraction corpus
for w in self.extCorpus.getTerms():
# build the chi-squared table
n11 = float(self.extCorpus.getTermCount(w))
n10 = float(self.lookupCorpus.getTermCount(w))
n01 = float(self.extCorpus.getTotalDocs() - n11)
n00 = float(self.lookupCorpus.getTotalDocs() - n10)
# perform the chi-squared calculation and store
# the score in the dictionary
a = n11 + n10 + n01 + n00
b = ((n11 * n00) - (n10 * n01)) ** 2
c = (n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00)
chi = (a * b) / c
scores[w] = chi
# sort the scores in descending order
scores = sorted([(v, k) for (k, v) in scores.items()], reverse = True)
i = 0
for (v, k) in scores:
print str(k) + " : " + str(v)
i += 1
if i == 10:
break
And this is how I use the class (some code omitted for brevity sake, and yes, I have checked to ensure that the two corpuses do not contain the exact same data.
# perform positive ngram feature selection
print "positive:\n"
f = ChiFeatureSelector(posCorpus, negCorpus)
f.select(posOutputPath)
print "\nnegative:\n"
# perform negative ngram feature selection
f = ChiFeatureSelector(negCorpus, posCorpus)
f.select(negOutputPath)
I feel like the error is coming from when I calculate term/document table but I'm not sure. Perhaps I am not understanding something. Can someone point me in the right direction?
In the two-class case, the chi-squared ranking of features is the same if the two data sets are exchanged. They are the features which differ the most between the two classes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With