Using sparse matrices/online learning in Naive Bayes (Python, scikit)

Question

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link: Implementing Bag-of-Words Naive-Bayes classifier in NLTK

The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):

Traceback (most recent call last):
  File "skitest.py", line 96, in <module>
    classif.train(add_label(matr, labels))
  File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
    for f in fs.iterkeys():
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
    return _cs_matrix.__getattr__(self, attr)
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
    raise AttributeError, attr + " not found"
AttributeError: iterkeys not found

My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?

Thanks for anyone's time. :)

EDIT, 6th sep:

Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:

matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()

#collect the data into the matrix

pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
                              for x in xrange(lentweets-foldsize)] 

classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))

The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?

Fred Foo · Accepted Answer

Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)

If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)

You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)

If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.

(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.

Using sparse matrices/online learning in Naive Bayes (Python, scikit)

Tags:

python

scipy

scikit-learn

sparse-matrix

user1638859

1 Answers

Fred Foo

Recent Activity

Donate For Us

Using sparse matrices/online learning in Naive Bayes (Python, scikit)

Tags:

python

scipy

scikit-learn

sparse-matrix

user1638859

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us