Can anyone tell me what's the problem with my code? Why I can predict probability of iris dataset by using LinearRegression but, KNeighborsClassifier gives me 0 or 1 while it should give me a result like the one LinearRegression yields?
from sklearn.datasets import load_iris
from sklearn import metrics
iris = load_iris()
X = iris.data
y = iris.target
for train_index, test_index in skf:
X_train, X_test = X_total[train_index], X_total[test_index]
y_train, y_test = y_total[train_index], y_total[test_index]
from sklearn.linear_model import LogisticRegression
ln = LogisticRegression()
ln.fit(X_train,y_train)
ln.predict_proba(X_test)[:,1]
array([ 0.18075722, 0.08906078, 0.14693156, 0.10467766, 0.14823032, 0.70361962, 0.65733216, 0.77864636, 0.67203114, 0.68655163, 0.25219798, 0.3863194 , 0.30735105, 0.13963637, 0.28017798])
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='euclidean')
knn.fit(X_train, y_train)
knn.predict_proba(X_test)[0:10,1]
array([ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])
Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results.
KNeighborsClassifier use 5 as the default value for n_neighbors (otherwise known as k), but this can easily be optimized using something like K-fold cross validation to try out different values for k and determine the best choice.
When k=1 you estimate your probability based on a single sample: your closest neighbor. This is very sensitive to all sort of distortions like noise, outliers, mislabelling of data, and so on. By using a higher value for k, you tend to be more robust against those distortions.
First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of neighbors in KNeighborsClassifier() function. Then, fit your model on the train set using fit() and perform prediction on the test set using predict().
Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results. Currently your points are simply always having 5 neighbours of the same label (thus probability 0 or 1).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With