Use scikit-learn to classify into multiple categories

Tags:

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match.

For example I have a piece of text:

"Theaters in New York compared to those in London"

And I have trained the algorithm to pick a place for every text snippet I feed it.

In the above example I would want it to return New York and London, but it only returns New York.

Is it possible to use scikit-learn to return multiple results? Or even return the label with the next highest probability?

Thanks for your help.

---Update

I tried using OneVsRestClassifier but I still only get one option back per piece of text. Below is the sample code I am using

y_train = ('New York','London')   train_set = ("new york nyc big apple", "london uk great britain") vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5} count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab) test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')  X_vectorized = count.transform(train_set).todense() smatrix2  = count.transform(test_set).todense()   base_clf = MultinomialNB(alpha=1)  clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train) Y_pred = clf.predict(smatrix2) print Y_pred

Result: ['New York' 'London' 'London']

675

asked May 10 '12 01:05

CodeMonkeyB

2 Answers

What you want is called multi-label classification. Scikits-learn can do that. See here: http://scikit-learn.org/dev/modules/multiclass.html.

I'm not sure what's going wrong in your example, my version of sklearn apparently doesn't have WordNGramAnalyzer. Perhaps it's a question of using more training examples or trying a different classifier? Though note that the multi-label classifier expects the target to be a list of tuples/lists of labels.

The following works for me:

import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier  X_train = np.array(["new york is a hell of a town",                     "new york was originally dutch",                     "the big apple is great",                     "new york is also called the big apple",                     "nyc is nice",                     "people abbreviate new york city as nyc",                     "the capital of great britain is london",                     "london is in the uk",                     "london is in england",                     "london is in great britain",                     "it rains a lot in london",                     "london hosts the british museum",                     "new york is great and so is london",                     "i like london better than new york"]) y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]] X_test = np.array(['nice day in nyc',                    'welcome to london',                    'hello welcome to new york. enjoy it here and london too'])    target_names = ['New York', 'London']  classifier = Pipeline([     ('vectorizer', CountVectorizer(min_n=1,max_n=2)),     ('tfidf', TfidfTransformer()),     ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, y_train) predicted = classifier.predict(X_test) for item, labels in zip(X_test, predicted):     print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

For me, this produces the output:

nice day in nyc => New York welcome to london => London hello welcome to new york. enjoy it here and london too => New York, London

Hope this helps.

answered Oct 13 '22 04:10

mwv

EDIT: Updated for Python 3, scikit-learn 0.18.1 using MultiLabelBinarizer as suggested.

I've been working on this as well, and made a slight enhancement to mwv's excellent answer that may be useful. It takes text labels as the input rather than binary labels and encodes them using MultiLabelBinarizer.

import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer  X_train = np.array(["new york is a hell of a town",                     "new york was originally dutch",                     "the big apple is great",                     "new york is also called the big apple",                     "nyc is nice",                     "people abbreviate new york city as nyc",                     "the capital of great britain is london",                     "london is in the uk",                     "london is in england",                     "london is in great britain",                     "it rains a lot in london",                     "london hosts the british museum",                     "new york is great and so is london",                     "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],                 ["new york"],["london"],["london"],["london"],["london"],                 ["london"],["london"],["new york","london"],["new york","london"]]  X_test = np.array(['nice day in nyc',                    'welcome to london',                    'london is rainy',                    'it is raining in britian',                    'it is raining in britian and the big apple',                    'it is raining in britian and nyc',                    'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London']  mlb = MultiLabelBinarizer() Y = mlb.fit_transform(y_train_text)  classifier = Pipeline([     ('vectorizer', CountVectorizer()),     ('tfidf', TfidfTransformer()),     ('clf', OneVsRestClassifier(LinearSVC()))])  classifier.fit(X_train, Y) predicted = classifier.predict(X_test) all_labels = mlb.inverse_transform(predicted)  for item, labels in zip(X_test, all_labels):     print('{0} => {1}'.format(item, ', '.join(labels)))

This gives me the following output:

nice day in nyc => new york welcome to london => london london is rainy => london it is raining in britian => london it is raining in britian and the big apple => new york it is raining in britian and nyc => london, new york hello welcome to new york. enjoy it here and london too => london, new york

answered Oct 13 '22 04:10

J Maurer

Related questions
                            
                                How to downcase the first character of a string?
                            
                                How do I add space between the ticklabels and the axes in matplotlib
                            
                                python pip specify a library directory and an include directory
                            
                                How to do product of matrices in PyTorch
                            
                                Difference between Python self and Java this
                            
                                Python iterating through object attributes [duplicate]
                            
                                JSON object must be str, bytes or bytearray, not dict
                            
                                What happens when you assign the value of one variable to another variable in Python?
                            
                                Differences between numpy.random.rand vs numpy.random.randn in Python
                            
                                Remove Sub String by using Python
                            
                                How to copy directory recursively in python and overwrite all?
                            
                                How to get current isoformat datetime string including the default timezone?
                            
                                Jinja2 inline comments
                            
                                How to convert a pymongo.cursor.Cursor into a dict?
                            
                                Splitting a pandas dataframe column by delimiter
                            
                                SyntaxError: unexpected EOF while parsing
                            
                                How do I convert a Python UUID into a string?
                            
                                pandas dataframe select columns in multiindex [duplicate]
                            
                                How to find datetime 10 mins after current time?
                            
                                Binary representation of float in Python (bits not hex)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use scikit-learn to classify into multiple categories

Tags:

python

classification

scikit-learn

CodeMonkeyB

People also ask

2 Answers

mwv

J Maurer

Recent Activity

Donate For Us