Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use scikit-learn to classify into multiple categories

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match.

For example I have a piece of text:

"Theaters in New York compared to those in London" 

And I have trained the algorithm to pick a place for every text snippet I feed it.

In the above example I would want it to return New York and London, but it only returns New York.

Is it possible to use scikit-learn to return multiple results? Or even return the label with the next highest probability?

Thanks for your help.

---Update

I tried using OneVsRestClassifier but I still only get one option back per piece of text. Below is the sample code I am using

y_train = ('New York','London')   train_set = ("new york nyc big apple", "london uk great britain") vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5} count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab) test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')  X_vectorized = count.transform(train_set).todense() smatrix2  = count.transform(test_set).todense()   base_clf = MultinomialNB(alpha=1)  clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train) Y_pred = clf.predict(smatrix2) print Y_pred 

Result: ['New York' 'London' 'London']

like image 675
CodeMonkeyB Avatar asked May 10 '12 01:05

CodeMonkeyB


People also ask

How do you train multi-class classification?

In a multiclass classification, we train a classifier using our training data and use this classifier for classifying new examples. Load dataset from the source. Split the dataset into “training” and “test” data. Train Decision tree, SVM, and KNN classifiers on the training data.

How do you perform multi-label classification?

Results: There are two main methods for tackling a multi-label classification problem: problem transformation methods and algorithm adaptation methods. Problem transformation methods transform the multi-label problem into a set of binary classification problems, which can then be handled using single-class classifiers.

What method does scikit-learn use for classifying operational?

This section will introduce three popular classification techniques: Logistic Regression, Discriminant Analysis, and Nearest Neighbor.


2 Answers

What you want is called multi-label classification. Scikits-learn can do that. See here: http://scikit-learn.org/dev/modules/multiclass.html.

I'm not sure what's going wrong in your example, my version of sklearn apparently doesn't have WordNGramAnalyzer. Perhaps it's a question of using more training examples or trying a different classifier? Though note that the multi-label classifier expects the target to be a list of tuples/lists of labels.

The following works for me:

import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier  X_train = np.array(["new york is a hell of a town",                     "new york was originally dutch",                     "the big apple is great",                     "new york is also called the big apple",                     "nyc is nice",                     "people abbreviate new york city as nyc",                     "the capital of great britain is london",                     "london is in the uk",                     "london is in england",                     "london is in great britain",                     "it rains a lot in london",                     "london hosts the british museum",                     "new york is great and so is london",                     "i like london better than new york"]) y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]] X_test = np.array(['nice day in nyc',                    'welcome to london',                    'hello welcome to new york. enjoy it here and london too'])    target_names = ['New York', 'London']  classifier = Pipeline([     ('vectorizer', CountVectorizer(min_n=1,max_n=2)),     ('tfidf', TfidfTransformer()),     ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, y_train) predicted = classifier.predict(X_test) for item, labels in zip(X_test, predicted):     print '%s => %s' % (item, ', '.join(target_names[x] for x in labels)) 

For me, this produces the output:

nice day in nyc => New York welcome to london => London hello welcome to new york. enjoy it here and london too => New York, London 

Hope this helps.

like image 51
mwv Avatar answered Oct 13 '22 04:10

mwv


EDIT: Updated for Python 3, scikit-learn 0.18.1 using MultiLabelBinarizer as suggested.

I've been working on this as well, and made a slight enhancement to mwv's excellent answer that may be useful. It takes text labels as the input rather than binary labels and encodes them using MultiLabelBinarizer.

import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer  X_train = np.array(["new york is a hell of a town",                     "new york was originally dutch",                     "the big apple is great",                     "new york is also called the big apple",                     "nyc is nice",                     "people abbreviate new york city as nyc",                     "the capital of great britain is london",                     "london is in the uk",                     "london is in england",                     "london is in great britain",                     "it rains a lot in london",                     "london hosts the british museum",                     "new york is great and so is london",                     "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],                 ["new york"],["london"],["london"],["london"],["london"],                 ["london"],["london"],["new york","london"],["new york","london"]]  X_test = np.array(['nice day in nyc',                    'welcome to london',                    'london is rainy',                    'it is raining in britian',                    'it is raining in britian and the big apple',                    'it is raining in britian and nyc',                    'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London']  mlb = MultiLabelBinarizer() Y = mlb.fit_transform(y_train_text)  classifier = Pipeline([     ('vectorizer', CountVectorizer()),     ('tfidf', TfidfTransformer()),     ('clf', OneVsRestClassifier(LinearSVC()))])  classifier.fit(X_train, Y) predicted = classifier.predict(X_test) all_labels = mlb.inverse_transform(predicted)  for item, labels in zip(X_test, all_labels):     print('{0} => {1}'.format(item, ', '.join(labels))) 

This gives me the following output:

nice day in nyc => new york welcome to london => london london is rainy => london it is raining in britian => london it is raining in britian and the big apple => new york it is raining in britian and nyc => london, new york hello welcome to new york. enjoy it here and london too => london, new york 
like image 25
J Maurer Avatar answered Oct 13 '22 04:10

J Maurer