I'm trying to perform my first KNN Classifier using SciKit-Learn. I've been following the User Guide and other online examples but there are a few things I am unsure about. For this post lets use the following
X = data Y = target
1) In most introduction to machine learning pages that I've read it seems to say you want a training set, a validation set, and a test set. From what I understand, cross validation allows you to combine the training and validations sets to train the model, and then you should test it on the test set to get a score. However, I have seen in papers that in a lot of cases you can just cross validate on the entire data set and then report the CV score as the accuracy. I understand in an ideal world you would want to test on separate data but if this is legitimate I would like to cross-validate on my entire dataset and report those scores
2) So starting the process
I define my KNN Classifier as follows
knn = KNeighborsClassifier(algorithm = 'brute')
I search for best n_neighbors using
clf = GridSearchCV(knn, parameters, cv=5)
Now if I say
clf.fit(X,Y)
I can check the best parameter using
clf.best_params_
and then I can get a score
clf.score(X,Y)
But - as I understand it, this hasn't cross validated the model, as it only gives 1 score?
If I have seen clf.best_params_ = 14 now could I go on
knn2 = KNeighborsClassifier(n_neighbors = 14, algorithm='brute')
cross_val_score(knn2, X, Y, cv=5)
Now I know the data has been cross validated but I don't know if it is legitimate to use clf.fit to find the best parameter and then use cross_val_score with a new knn model?
3) I understand that the 'proper' way to do it would be as follows
Split to X_train, X_test, Y_train, Y_test, Scale train sets -> apply transform to test sets
knn = KNeighborsClassifier(algorithm = 'brute')
clf = GridSearchCV(knn, parameters, cv=5)
clf.fit(X_train,Y_train)
clf.best_params_
and then I can get a score
clf.score(X_test,Y_test)
In this case, is the score calculated using the best parameter?
I hope that this makes sense. I've been trying to find as much as I can without posting but I have come to the point where I think it would be easier to get some direct answers.
In my head I am trying to get some cross-validated scores using the whole dataset but also use a gridsearch (or something similar) to fine tune the parameters.
Thanks in advance
Yes you can CV on your entire dataset it is viable, but I still suggest you to at least split your data into 2 sets one for CV and one for testing.
The .score
function is supposed to return a single float
value according to the documentation which is the score of the best estimator
(which is the best scored estimator you get from fitting your GridSearchCV
) on the given X,Y
Hope that makes the things clearer :)
If the dataset is small, you may not have the luxury for a train/test split. People often estimate the predictive power of the model solely based on cross-validation. In your code above, the GridSearchCV performs 5-fold cross-validation when you fit the model (clf.fit(X, y)
) by splitting your train set into an inner train set (80%) and a validation set (20%).
You can access the model performance metrics including validation scores through clf.cv_results_
. The metrics you want to look at including mean_test_score
(In your case, you should have 1 score for each n_neighbor
). You may also want to turn on 'mean_train_score' to have a sense of whether the model is overfitting. See sample code below for model setup (Note knn is a non-parametric ML model that is sensitive to the scale of the features so people often normalize features using StandardScaler):
pipe = Pipeline([
('sc', StandardScaler()),
('knn', KNeighborsClassifier(algorithm='brute'))
])
params = {
'knn__n_neighbors': [3, 5, 7, 9, 11] # usually odd numbers
}
clf = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_score=True) # Turn on cv train scores
clf.fit(X, y)
A quick tip: the square-root of the number of samples is usually a good choice of n_neighbor
so make sure you include that in your GridSearchCV. Hope this is helpful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With