Classification of text documents is a simple task with scikit-learn but there isn't a clean support of that in NLTK, also there are samples for doing that in hard way like this. I want to preprocess with NLTK and classify with sckit-learn and I found SklearnClassifier in NLTK, but there is a little problem.
In scikit-learn everything is OK:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
X_train = [[0, 0], [0, 1], [1, 1]]
y_train = [('first',), ('second',), ('first', 'second')]
clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train)
print clf.classes_
The result is ['first' 'second']
and it's my expectation. But when I try to use same code in NLTK:
from nltk.classify import SklearnClassifier
X_train = [{'a': 1}, {'b': 1}, {'c': 1}]
y_train = [('first',), ('second',), ('first', 'second')]
clf = SklearnClassifier(OneVsRestClassifier(MultinomialNB()))
clf.train(zip(X_train, y_train))
print clf.labels()
The result is [('first',), ('second',), ('first', 'second')]
and it isn't the proper one. Is there any solution?
The NLTK wrapper for scikit-learn doesn't know about multilabel classification, and it shouldn't because it doesn't implement MultiClassifierI
. Implementing that would require a separate class.
You can either implement the missing functionality, or use scikit-learn without the wrapper. Newer versions of scikit-learn have a DictVectorizer
that accepts roughly the same inputs that the NLTK wrapper accepts:
from sklearn.feature_extraction import DictVectorizer
X_train_raw = [{'a': 1}, {'b': 1}, {'c': 1}]
y_train = [('first',), ('second',), ('first', 'second')]
v = DictVectorizer()
X_train = v.fit_transform(X_train_raw)
clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train)
You can then use X_test = v.transform(X_test_raw)
to transform test samples to matrices. A sklearn.pipeline.Pipeline
makes this easier by tying a vectorizer and a classifier together in a single object.
Disclaimer: according to the FAQ, I should disclose my affiliation. I wrote both DictVectorizer
and the NLTK wrapper for scikit-learn.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With