Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get Top 3 or Top N predictions using sklearn's SGDClassifier

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements

In the above code, clf.predict() prints only 1 best prediction for a sample from list X. I am interested in top 3 predictions for a particular sample in the list X, i know the function predict_proba/predict_log_proba returns a list of all probabilities for each feature in list Y, but it has to sorted and then associated with the features in list Y before getting the top 3 results. Is there any direct and efficient way?

like image 593
Pranay Mathur Avatar asked Sep 08 '15 15:09

Pranay Mathur


People also ask

What does predict () function of Sklearn do?

The Sklearn 'Predict' Method Predicts an OutputThat being the case, it provides a set of tools for doing things like training and evaluating machine learning models. What is this? And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).

How does predict proba work?

Apply the model to the given dataset to predict the probability that the object belongs to the given classes. The model prediction results will be correct only if the data parameter with feature values contains all the features used in the model.

What is the difference between predict and predict_proba?

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).


4 Answers

There is no built-in function, but what is wrong with

probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]

?

As suggested by one of the comment, should change [-n:] to [:,-n:]

probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]
like image 110
Andreas Mueller Avatar answered Oct 01 '22 09:10

Andreas Mueller


I know this has been answered...but I can add a bit more...

#both preds and truths are same shape m by n (m is number of predictions and n is number of classes)
def top_n_accuracy(preds, truths, n):
    best_n = np.argsort(preds, axis=1)[:,-n:]
    ts = np.argmax(truths, axis=1)
    successes = 0
    for i in range(ts.shape[0]):
      if ts[i] in best_n[i,:]:
        successes += 1
    return float(successes)/ts.shape[0]

It's quick and dirty but I find it useful. One can add their own error checking, etc..

like image 27
user1269942 Avatar answered Oct 01 '22 10:10

user1269942


Hopefully, Andreas will help with this. predict_probs is not available when loss='hinge'. To get top n class when loss='hinge' do:

calibrated_clf = CalibratedClassifierCV(clfSDG, cv=3, method='sigmoid')
model = calibrated_clf.fit(train.data, train.label)

probs = model.predict_proba(test_data)
sorted( zip( calibrated_clf.classes_, probs[0] ), key=lambda x:x[1] )[-n:]

Not sure if clfSDG.predict and calibrated_clf.predict will always predict the same class.

like image 5
valearner Avatar answered Oct 01 '22 08:10

valearner


argsort gives results in ascending order, if you want to save yourself with unusual loops or confusion you can use a simple trick.

probs = clf.predict_proba(test)
best_n = np.argsort(-probs, axis=1)[:, :n]

Negating the probabilities will turn smallest to largest and hence you can take top-n results in descending order.

like image 2
Gaurav Singhal Avatar answered Oct 01 '22 09:10

Gaurav Singhal