I'm using Python sklearn (version 0.17) to select the ideal model on a data set. To do this, I followed these steps:
cross_validation.train_test_split
with test_size = 0.2
.GridSearchCV
to select the ideal k-nearest-neighbors classifier on the training set.GridSearchCV
to plot_learning_curve
. plot_learning_curve
gave the plot shown below.GridSearchCV
on the test set obtained.From the plot, we can see that the score for the max. training size is about 0.43. This score is the score returned by sklearn.learning_curve.learning_curve
function.
But when I run the best classifier on the test set I get an accuracy score of 0.61, as returned by sklearn.metrics.accuracy_score
(correctly predicted labels / number of labels)
Link to image:
This is the code I am using. I have not included the plot_learning_curve
function as it would take a lot of space. I took the plot_learning_curve
from here
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
import sys
from sklearn import cross_validation
from sklearn.learning_curve import learning_curve
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
filename = sys.argv[1]
data = np.loadtxt(fname = filename, delimiter = ',')
X = data[:, 0:-1]
y = data[:, -1] # last column is the label column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
params = {'n_neighbors': [2, 3, 5, 7, 10, 20, 30, 40, 50],
'weights': ['uniform', 'distance']}
clf = GridSearchCV(KNeighborsClassifier(), param_grid=params)
clf.fit(X_train, y_train)
y_true, y_pred = y_test, clf.predict(X_test)
acc = accuracy_score(y_pred, y_test)
print 'accuracy on test set =', acc
print clf.best_params_
for params, mean_score, scores in clf.grid_scores_:
print "%0.3f (+/-%0.03f) for %r" % (
mean_score, scores.std() / 2, params)
y_true, y_pred = y_test, clf.predict(X_test)
#pred = clf.predict(np.array(features_test))
acc = accuracy_score(y_pred, y_test)
print classification_report(y_true, y_pred)
print 'accuracy last =', acc
print
plot_learning_curve(clf, "KNeighborsClassifier",
X, y,
train_sizes=np.linspace(.05, 1.0, 5))
Is this normal? I can understand there might be some difference in scores but this is a difference of 0.18, which when converted to percentages is 43% against 61%. The classification_report also gives an average 0.61 recall.
Am I doing something wrong? Is there a difference in how learning_curve
calculates scores? I also tried passing scoring='accuracy'
to learning_curve
function to see if it matches the accuracy score, but it didn't make any difference.
Any advice would be greatly helpful.
I'm using the wine quality(white) data set from UCI and also removed the header before running the code.
When you call the learning_curve
function it performs a cross-validation on your whole data. As you are leaving the cv
parameter empty it is a 3-fold cross validation splitting strategy. And here comes the tricky part because as stated in the documentation "If the estimator is a classifier or if y is neither binary nor multiclass, KFold is used". And your estimator is a classifier.
So, what's the difference between KFold and StratifiedKFold?
KFold = Split dataset into k consecutive folds (without shuffling by default)
StratifiedKFold = "The folds are made by preserving the percentage of samples for each class."
Let's make a simple example:
This explains your low accuracy drawing your learning curve (~0.43%). Of course this is an extreme example to illustrate the situation, but your data is somehow structured and you need to shuffle it.
But when you obtain the ~61% accuracy you have splitted data by method train_test_split
which by default performs a shuffle on data and keeps proportions.
Just look at this, I've performed a simple test to support my hypothesis:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0., random_state=2)
In your example you fed the learning_curve
with all of your data X,y
. I'm doing a little trick here which is to split data telling test_size=0.
meaning all data is in train
variables. This way I'm still keeping all data, but it is now shuffled as it went through the train_test_split
function.
Then I've called your plotting function but with shuffled data:
plot_learning_curve(clf, "KNeighborsClassifier",X_train2, y_train2, train_sizes=np.linspace(.05, 1.0, 5))
Now the output with max num training samples instead of 0.43
is 0.59
which makes a lot more sense with your GridSearch results.
Observation: I think the whole point of plotting the learning curve is to determine wether adding more samples to the training set our estimator is able to perform better or not (so you can decide for example when there is no need to add more examples). As in the
train_sizes
you just feed the valuesnp.linspace(.05, 1.0, 5) --> [ 0.05 , 0.2875, 0.525 , 0.7625, 1. ]
I'm not entirely sure that this is the usage you are pursuing in this kind of test.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With