I am doing hyperparameter tuning with GridSearchCV
for Decision Trees. I have fit the model and I am trying to find what does exactly Gridsearch.cv_results_
gives. I have read the documentation but still its not clear. Could anyone explain this attribute?
my code is below :
depth={"max_depth":[1,5,10,50,100,500,1000],
"min_samples_split":[5,10,100,500]}
DTC=DecisionTreeClassifier(class_weight="balanced")
DTC_Grid=GridSearchCV(DTC,param_grid=depth , cv=3, scoring='roc_auc')
DTC_Bow=DTC_Grid.fit(xtrain_bow,ytrain_bow)
Grid-search is used to find the optimal hyperparameters of a model which results in the most 'accurate' predictions.
The mean_test_score that sklearn returns is the mean calculated on all samples where each sample has the same weight. If you calculate the mean by taking the mean of the folds (splits), then you only get the same results if the folds are all of equal size.
GridSearch is a tool that is used for hyperparameter tuning. As stated before, Machine Learning in practice comes down to comparing different models to each other and trying to find the best working model.
To recap grid search: Advantages: exhaustive search, will find the absolute best way to tune the hyperparameters based on the training set. Disadvantages: time-consuming, danger of overfitting.
DTC_Bow.cv_results_ returns a dictionary of all the evaluation metrics from the gridsearch. To visualize it properly, you can do
pd.DataFrame(DTC_Bow.cv_results_)
In your case, this should return a dataframe with 28 rows (7 choices for max_depth
times 4 choices for min_samples_split
). Each row of this dataframe gives the gridsearch metrics for one combination of these two parameters. Remember the goal of a gridsearch is to select which combination of parameters will have the best performance metrics. This is the purpose of cv_results_
.
You should have one column called param_max_depth
and another called param_min_samples_leaf
referencing the value of the parameter for each row. The combination of two is summarized as a dictionary in the column params
.
Now to the metrics. Default value for return_train_score
was True
up until now but they will change it to False
in version 0.21. If you want the train metrics, set it to True
. But usually, what you are interested in are the test metrics.
The main column is mean_test_score
. This is the average of columns split_0_test_score, split_1_test_score, split_2_test_score
(because you are doing a 3 fold split in your gridsearch). If you do DTC_Bow.best_score_
this will return the max value of the column mean_test_score
. Column rank_test_score
ranks all parameter combinations by the values of mean_test_score
.
You might also want to look at std_test_score
which is the standard deviation of split_0_test_score, split_1_test_score, split_2_test_score
. This might be of interest if you want to see how consistently your set of parameters is performing on your hold-out data.
As mentioned, you can have the metrics on the train set as well provided you set return_train_score = True
.
Finally, there are also time columns, that tell you how much time it took for each row. It measures how much time it took to train the model (mean_fit_time, std_fit_time
) and to evaluate it (mean_score_time, std_score_time
). This is just a FYI, usually, unless time is a bottleneck, you would not really look at these metrics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With