I am doing hyperparameter tuning with <code>GridSearchCV</code> for Decision Trees. I have fit the model and I am trying to find what does exactly <code>Gridsearch.cv_results_</code> gives. I have read the documentation but still its not clear. Could anyone explain this attribute? my code is below : <pre class="prettyprint"><code>depth={"max_depth":[1,5,10,50,100,500,1000], "min_samples_split":[5,10,100,500]} DTC=DecisionTreeClassifier(class_weight="balanced") DTC_Grid=GridSearchCV(DTC,param_grid=depth , cv=3, scoring='roc_auc') DTC_Bow=DTC_Grid.fit(xtrain_bow,ytrain_bow) </code></pre>

DTC_Bow.cv_results_ returns a dictionary of all the evaluation metrics from the gridsearch. To visualize it properly, you can do <pre class="prettyprint"><code>pd.DataFrame(DTC_Bow.cv_results_) </code></pre> In your case, this should return a dataframe with 28 rows (7 choices for <code>max_depth</code> times 4 choices for <code>min_samples_split</code>). Each row of this dataframe gives the gridsearch metrics for one combination of these two parameters. Remember the goal of a gridsearch is to select which combination of parameters will have the best performance metrics. This is the purpose of <code>cv_results_</code>. You should have one column called <code>param_max_depth</code> and another called <code>param_min_samples_leaf</code> referencing the value of the parameter for each row. The combination of two is summarized as a dictionary in the column <code>params</code>. Now to the metrics. Default value for <code>return_train_score</code> was <code>True</code> up until now but they will change it to <code>False</code> in version 0.21. If you want the train metrics, set it to <code>True</code>. But usually, what you are interested in are the test metrics. The main column is <code>mean_test_score</code>. This is the average of columns <code>split_0_test_score, split_1_test_score, split_2_test_score</code> (because you are doing a 3 fold split in your gridsearch). If you do <code>DTC_Bow.best_score_</code> this will return the max value of the column <code>mean_test_score</code>. Column <code>rank_test_score</code> ranks all parameter combinations by the values of <code>mean_test_score</code>. You might also want to look at <code>std_test_score</code> which is the standard deviation of <code>split_0_test_score, split_1_test_score, split_2_test_score</code>. This might be of interest if you want to see how consistently your set of parameters is performing on your hold-out data. As mentioned, you can have the metrics on the train set as well provided you set <code>return_train_score = True</code>. Finally, there are also time columns, that tell you how much time it took for each row. It measures how much time it took to train the model (<code>mean_fit_time, std_fit_time</code>) and to evaluate it (<code>mean_score_time, std_score_time</code>). This is just a FYI, usually, unless time is a bottleneck, you would not really look at these metrics.

what is Gridsearch.cv_results_ , could any explain all the things in that i.e mean_test_score etc .?

Tags:

python

machine-learning

scikit-learn

grid-search

I am doing hyperparameter tuning with GridSearchCV for Decision Trees. I have fit the model and I am trying to find what does exactly Gridsearch.cv_results_ gives. I have read the documentation but still its not clear. Could anyone explain this attribute?

my code is below :

depth={"max_depth":[1,5,10,50,100,500,1000],
       "min_samples_split":[5,10,100,500]}       

DTC=DecisionTreeClassifier(class_weight="balanced")

DTC_Grid=GridSearchCV(DTC,param_grid=depth , cv=3, scoring='roc_auc')
DTC_Bow=DTC_Grid.fit(xtrain_bow,ytrain_bow)

886

asked Feb 09 '19 16:02

Vishal Suryavanshi

1 Answers

DTC_Bow.cv_results_ returns a dictionary of all the evaluation metrics from the gridsearch. To visualize it properly, you can do

pd.DataFrame(DTC_Bow.cv_results_)

In your case, this should return a dataframe with 28 rows (7 choices for max_depth times 4 choices for min_samples_split). Each row of this dataframe gives the gridsearch metrics for one combination of these two parameters. Remember the goal of a gridsearch is to select which combination of parameters will have the best performance metrics. This is the purpose of cv_results_.

You should have one column called param_max_depth and another called param_min_samples_leaf referencing the value of the parameter for each row. The combination of two is summarized as a dictionary in the column params.

Now to the metrics. Default value for return_train_score was True up until now but they will change it to False in version 0.21. If you want the train metrics, set it to True. But usually, what you are interested in are the test metrics.

The main column is mean_test_score. This is the average of columns split_0_test_score, split_1_test_score, split_2_test_score (because you are doing a 3 fold split in your gridsearch). If you do DTC_Bow.best_score_ this will return the max value of the column mean_test_score. Column rank_test_score ranks all parameter combinations by the values of mean_test_score.

You might also want to look at std_test_score which is the standard deviation of split_0_test_score, split_1_test_score, split_2_test_score. This might be of interest if you want to see how consistently your set of parameters is performing on your hold-out data.

As mentioned, you can have the metrics on the train set as well provided you set return_train_score = True.

Finally, there are also time columns, that tell you how much time it took for each row. It measures how much time it took to train the model (mean_fit_time, std_fit_time) and to evaluate it (mean_score_time, std_score_time). This is just a FYI, usually, unless time is a bottleneck, you would not really look at these metrics.

142

answered Jan 04 '23 12:01

MaximeKan

Related questions
                            
                                OSMNx : get coordinates of nodes using OSM id
                            
                                Finding equal values from a list of list of tuples in Python
                            
                                Matplotlib savefig() over multiple graphs keeps saving the same graph
                            
                                prefetch_related for Authenticated user
                            
                                Django: Read uploaded CSV file using FileField instance
                            
                                difference between str(dict) and json.dumps(dict)
                            
                                Creating a mixture of probability distributions for sampling
                            
                                keras bidirectional lstm seq2seq
                            
                                updated object's attribute in python class, but not getting reflected
                            
                                fit-transform on training data and transform on test data [duplicate]
                            
                                Using Apply in Pandas Lambda functions with multiple if statements
                            
                                Multiple sets of duplicate records from a pandas dataframe
                            
                                Encode numpy array using uncompressed RLE for COCO dataset
                            
                                When does Python check whether a concrete subclass of an ABC implements the required methods?
                            
                                How to group by and aggregate on multiple columns in pandas
                            
                                Element-wise comparison with NaNs as equal
                            
                                Blank line below headers created when using MultiIndex and to_excel in Python
                            
                                Remove 'command not found' error discord.py
                            
                                How to check which Python interpreter Spyder is running on its console?
                            
                                Implementing custom loss function in scikit learn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With