Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probability prediction method of KNeighborsClassifier returns only 0 and 1

Can anyone tell me what's the problem with my code? Why I can predict probability of iris dataset by using LinearRegression but, KNeighborsClassifier gives me 0 or 1 while it should give me a result like the one LinearRegression yields?

from sklearn.datasets import load_iris
from sklearn import metrics

iris = load_iris()
X = iris.data
y = iris.target

for train_index, test_index in skf:
    X_train, X_test = X_total[train_index], X_total[test_index]
    y_train, y_test = y_total[train_index], y_total[test_index]

from sklearn.linear_model import LogisticRegression
ln = LogisticRegression()
ln.fit(X_train,y_train)

ln.predict_proba(X_test)[:,1]

array([ 0.18075722, 0.08906078, 0.14693156, 0.10467766, 0.14823032, 0.70361962, 0.65733216, 0.77864636, 0.67203114, 0.68655163, 0.25219798, 0.3863194 , 0.30735105, 0.13963637, 0.28017798])

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='euclidean')
knn.fit(X_train, y_train)

knn.predict_proba(X_test)[0:10,1]

array([ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])

like image 777
Kasra Babaei Avatar asked May 07 '16 13:05

Kasra Babaei


People also ask

Can KNN predict probability?

Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results.

What should be the default value of n_neighbors?

KNeighborsClassifier use 5 as the default value for n_neighbors (otherwise known as k), but this can easily be optimized using something like K-fold cross validation to try out different values for k and determine the best choice.

Why does K 1 in KNN give the best accuracy?

When k=1 you estimate your probability based on a single sample: your closest neighbor. This is very sensitive to all sort of distortions like noise, outliers, mislabelling of data, and so on. By using a higher value for k, you tend to be more robust against those distortions.

How do you use KNeighborsClassifier in Python?

First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of neighbors in KNeighborsClassifier() function. Then, fit your model on the train set using fit() and perform prediction on the test set using predict().


1 Answers

Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results. Currently your points are simply always having 5 neighbours of the same label (thus probability 0 or 1).

like image 57
lejlot Avatar answered Jan 01 '23 17:01

lejlot