GridSearchCV.best_score_ meaning when scoring set to 'accuracy' and CV

Question

I'm trying to find the best model Neural Network model applied for the classification of breast cancer samples on the well-known Wisconsin Cancer dataset (569 samples, 31 features + target). I'm using sklearn 0.18.1. I'm not using Normalization so far. I'll add it when I solve this question.

# some init code omitted
X_train, X_test, y_train, y_test = train_test_split(X, y)

Define params NN params for the GridSearchCV

tuned_params = [{'solver': ['sgd'], 'learning_rate': ['constant'], "learning_rate_init" : [0.001, 0.01, 0.05, 0.1]},
                {"learning_rate_init" : [0.001, 0.01, 0.05, 0.1]}]

CV method and model

cv_method = KFold(n_splits=4, shuffle=True)
model = MLPClassifier()

Apply grid

grid = GridSearchCV(estimator=model, param_grid=tuned_params, cv=cv_method, scoring='accuracy')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)

And if I run:

print(grid.best_score_)
print(accuracy_score(y_test, y_pred))

The result is 0.746478873239 and 0.902097902098

According to the doc "best_score_ : float, Score of best_estimator on the left out data". I assume it is the best accuracy among the ones obtained running the 8 different configuration as especified in tuned_params the number of times especified by KFold, on the left out data as especified by KFold. Am I right?

One more question. Is there a method to find the optimal size of test data to use in train_test_split which defaults to 0.25?

Thanks a lot

REFERENCES

http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
http://scikit-learn.org/stable/modules/grid_search.html
http://scikit-learn.org/stable/modules/cross_validation.html
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py

Vivek Kumar · Accepted Answer

The grid.best_score_ is the average of all cv folds for a single combination of the parameters you specify in the tuned_params.

In order to access other relevant details about the grid searching process, you can look at the grid.cv_results_ attribute.

From the documentation of GridSearchCV:

cv_results_ : dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, 
that can be imported into a pandas DataFrame

It contains keys like 'split0_test_score', 'split1_test_score' , 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'mean_train_score', etc, which gives additional information about the whole execution.

GridSearchCV.best_score_ meaning when scoring set to 'accuracy' and CV

Tags:

python

pandas

scikit-learn

cross-validation

grid-search

Taka

1 Answers

Vivek Kumar

Recent Activity

Donate For Us

GridSearchCV.best_score_ meaning when scoring set to 'accuracy' and CV

Tags:

python

pandas

scikit-learn

cross-validation

grid-search

Taka

1 Answers

Vivek Kumar

Related questions

Recent Activity

Donate For Us