from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements
In the above code, clf.predict()
prints only 1 best prediction for a sample from list X.
I am interested in top 3 predictions for a particular sample in the list X, i know the function predict_proba
/predict_log_proba
returns a list of all probabilities for each feature in list Y, but it has to sorted and then associated with the features in list Y before getting the top 3 results.
Is there any direct and efficient way?
The Sklearn 'Predict' Method Predicts an OutputThat being the case, it provides a set of tools for doing things like training and evaluating machine learning models. What is this? And it also has tools to predict an output value, once the model is trained (for ML techniques that actually make predictions).
Apply the model to the given dataset to predict the probability that the object belongs to the given classes. The model prediction results will be correct only if the data parameter with feature values contains all the features used in the model.
The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).
There is no built-in function, but what is wrong with
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]
As suggested by one of the comment, should change [-n:]
to [:,-n:]
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]
I know this has been answered...but I can add a bit more...
#both preds and truths are same shape m by n (m is number of predictions and n is number of classes)
def top_n_accuracy(preds, truths, n):
best_n = np.argsort(preds, axis=1)[:,-n:]
ts = np.argmax(truths, axis=1)
successes = 0
for i in range(ts.shape[0]):
if ts[i] in best_n[i,:]:
successes += 1
return float(successes)/ts.shape[0]
It's quick and dirty but I find it useful. One can add their own error checking, etc..
Hopefully, Andreas will help with this. predict_probs is not available when loss='hinge'. To get top n class when loss='hinge' do:
calibrated_clf = CalibratedClassifierCV(clfSDG, cv=3, method='sigmoid')
model = calibrated_clf.fit(train.data, train.label)
probs = model.predict_proba(test_data)
sorted( zip( calibrated_clf.classes_, probs[0] ), key=lambda x:x[1] )[-n:]
Not sure if clfSDG.predict and calibrated_clf.predict will always predict the same class.
argsort
gives results in ascending order, if you want to save yourself with unusual loops or confusion you can use a simple trick.
probs = clf.predict_proba(test)
best_n = np.argsort(-probs, axis=1)[:, :n]
Negating the probabilities will turn smallest to largest and hence you can take top-n results in descending order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With