Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is Gridsearch.cv_results_ , could any explain all the things in that i.e mean_test_score etc .?

I am doing hyperparameter tuning with GridSearchCV for Decision Trees. I have fit the model and I am trying to find what does exactly Gridsearch.cv_results_ gives. I have read the documentation but still its not clear. Could anyone explain this attribute?

my code is below :

depth={"max_depth":[1,5,10,50,100,500,1000],
       "min_samples_split":[5,10,100,500]}       

DTC=DecisionTreeClassifier(class_weight="balanced")

DTC_Grid=GridSearchCV(DTC,param_grid=depth , cv=3, scoring='roc_auc')
DTC_Bow=DTC_Grid.fit(xtrain_bow,ytrain_bow) 
like image 886
Vishal Suryavanshi Avatar asked Feb 09 '19 16:02

Vishal Suryavanshi


People also ask

What is GridSearch used for?

Grid-search is used to find the optimal hyperparameters of a model which results in the most 'accurate' predictions.

What is Mean_test_score?

The mean_test_score that sklearn returns is the mean calculated on all samples where each sample has the same weight. If you calculate the mean by taking the mean of the folds (splits), then you only get the same results if the folds are all of equal size.

What is GridSearch in machine learning?

GridSearch is a tool that is used for hyperparameter tuning. As stated before, Machine Learning in practice comes down to comparing different models to each other and trying to find the best working model.

What are the advantages and disadvantages of the GridSearch method?

To recap grid search: Advantages: exhaustive search, will find the absolute best way to tune the hyperparameters based on the training set. Disadvantages: time-consuming, danger of overfitting.


1 Answers

DTC_Bow.cv_results_ returns a dictionary of all the evaluation metrics from the gridsearch. To visualize it properly, you can do

pd.DataFrame(DTC_Bow.cv_results_)

In your case, this should return a dataframe with 28 rows (7 choices for max_depth times 4 choices for min_samples_split). Each row of this dataframe gives the gridsearch metrics for one combination of these two parameters. Remember the goal of a gridsearch is to select which combination of parameters will have the best performance metrics. This is the purpose of cv_results_.

You should have one column called param_max_depth and another called param_min_samples_leaf referencing the value of the parameter for each row. The combination of two is summarized as a dictionary in the column params.

Now to the metrics. Default value for return_train_score was True up until now but they will change it to False in version 0.21. If you want the train metrics, set it to True. But usually, what you are interested in are the test metrics.

The main column is mean_test_score. This is the average of columns split_0_test_score, split_1_test_score, split_2_test_score (because you are doing a 3 fold split in your gridsearch). If you do DTC_Bow.best_score_ this will return the max value of the column mean_test_score. Column rank_test_score ranks all parameter combinations by the values of mean_test_score.

You might also want to look at std_test_score which is the standard deviation of split_0_test_score, split_1_test_score, split_2_test_score. This might be of interest if you want to see how consistently your set of parameters is performing on your hold-out data.

As mentioned, you can have the metrics on the train set as well provided you set return_train_score = True.

Finally, there are also time columns, that tell you how much time it took for each row. It measures how much time it took to train the model (mean_fit_time, std_fit_time) and to evaluate it (mean_score_time, std_score_time). This is just a FYI, usually, unless time is a bottleneck, you would not really look at these metrics.

like image 142
MaximeKan Avatar answered Jan 04 '23 12:01

MaximeKan