Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grid Search parameter and cross-validated data set in KNN classifier in Scikit-learn

I'm trying to perform my first KNN Classifier using SciKit-Learn. I've been following the User Guide and other online examples but there are a few things I am unsure about. For this post lets use the following

X = data Y = target

1) In most introduction to machine learning pages that I've read it seems to say you want a training set, a validation set, and a test set. From what I understand, cross validation allows you to combine the training and validations sets to train the model, and then you should test it on the test set to get a score. However, I have seen in papers that in a lot of cases you can just cross validate on the entire data set and then report the CV score as the accuracy. I understand in an ideal world you would want to test on separate data but if this is legitimate I would like to cross-validate on my entire dataset and report those scores

2) So starting the process

I define my KNN Classifier as follows

knn = KNeighborsClassifier(algorithm = 'brute')

I search for best n_neighbors using

clf = GridSearchCV(knn, parameters, cv=5)

Now if I say

clf.fit(X,Y)

I can check the best parameter using

clf.best_params_

and then I can get a score

clf.score(X,Y)

But - as I understand it, this hasn't cross validated the model, as it only gives 1 score?

If I have seen clf.best_params_ = 14 now could I go on

knn2 = KNeighborsClassifier(n_neighbors = 14, algorithm='brute')
cross_val_score(knn2, X, Y, cv=5)

Now I know the data has been cross validated but I don't know if it is legitimate to use clf.fit to find the best parameter and then use cross_val_score with a new knn model?

3) I understand that the 'proper' way to do it would be as follows

Split to X_train, X_test, Y_train, Y_test, Scale train sets -> apply transform to test sets

knn = KNeighborsClassifier(algorithm = 'brute')
clf = GridSearchCV(knn, parameters, cv=5)
clf.fit(X_train,Y_train)
clf.best_params_

and then I can get a score

clf.score(X_test,Y_test)

In this case, is the score calculated using the best parameter?


I hope that this makes sense. I've been trying to find as much as I can without posting but I have come to the point where I think it would be easier to get some direct answers.

In my head I am trying to get some cross-validated scores using the whole dataset but also use a gridsearch (or something similar) to fine tune the parameters.

Thanks in advance

like image 972
browser Avatar asked Nov 16 '16 14:11

browser


2 Answers

  1. Yes you can CV on your entire dataset it is viable, but I still suggest you to at least split your data into 2 sets one for CV and one for testing.

  2. The .score function is supposed to return a single float value according to the documentation which is the score of the best estimator(which is the best scored estimator you get from fitting your GridSearchCV) on the given X,Y

  3. If you saw that the best parameter is 14 than yes you can go on whith using it in your model, but if you gave it more parameters you should set all of them. (- I say that because you haven't given your parameters list) And yes it is legitimate to check your CV once again just in case if this model is as good as it should.

Hope that makes the things clearer :)

like image 120
nitheism Avatar answered Sep 29 '22 11:09

nitheism


If the dataset is small, you may not have the luxury for a train/test split. People often estimate the predictive power of the model solely based on cross-validation. In your code above, the GridSearchCV performs 5-fold cross-validation when you fit the model (clf.fit(X, y)) by splitting your train set into an inner train set (80%) and a validation set (20%).

You can access the model performance metrics including validation scores through clf.cv_results_. The metrics you want to look at including mean_test_score (In your case, you should have 1 score for each n_neighbor). You may also want to turn on 'mean_train_score' to have a sense of whether the model is overfitting. See sample code below for model setup (Note knn is a non-parametric ML model that is sensitive to the scale of the features so people often normalize features using StandardScaler):

    pipe = Pipeline([
        ('sc', StandardScaler()),     
        ('knn', KNeighborsClassifier(algorithm='brute')) 
    ])
    params = {
        'knn__n_neighbors': [3, 5, 7, 9, 11] # usually odd numbers
    }
    clf = GridSearchCV(estimator=pipe,           
                      param_grid=params, 
                      cv=5,
                      return_train_score=True) # Turn on cv train scores
    clf.fit(X, y)

A quick tip: the square-root of the number of samples is usually a good choice of n_neighbor so make sure you include that in your GridSearchCV. Hope this is helpful.

like image 31
Kai Zhao Avatar answered Sep 29 '22 12:09

Kai Zhao