Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does sklearn.grid_search.GridSearchCV return random results on every execution?

I was trying to get the optimum features for a decision tree classifier over the Iris dataset using sklearn.grid_search.GridSearchCV. I used StratifiedKFold (sklearn.cross_validation.StratifiedKFold) for cross-validation, since my data was biased. But on every execution of GridSearchCV, it returned a different set of parameters.
Shouldn't it return the same set of optimum parameters given that the data and the cross-validation was same every single time?

Source code follows:

from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV

decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
                  'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(all_classes, n_folds=10)

grid_search = GridSearchCV(decision_tree_classifier, param_grid = parameter_grid,
                          cv = cross_validation)

grid_search.fit(all_inputs, all_classes)

print "Best Score: {}".format(grid_search.best_score_)
print "Best params: {}".format(grid_search.best_params_)

Outputs:

Best Score: 0.959731543624
Best params: {'max_features': 2, 'max_depth': 2}

Best Score: 0.973154362416
Best params: {'max_features': 3, 'max_depth': 5}

Best Score: 0.973154362416
Best params: {'max_features': 2, 'max_depth': 5}

Best Score: 0.959731543624
Best params: {'max_features': 3, 'max_depth': 3}

This is an excerpt from an Ipython notebook which I made recently, with reference to Randal S Olson's notebook, which can be found here.

Edit: Its not the random_state parameter of StratifiedKFold which results in varied results but rather the random_state parameter of DecisionTreeClassifer which randomly initializes the tree and gives varied results (refer documentation). As for StratifiedKFold, as long as the shuffle parameter is set to False (default), it generates the same training-test split (refer documentation).

like image 225
darthy Avatar asked Oct 17 '22 15:10

darthy


2 Answers

The training results depend on the way the train data is splitted in cross validation. Each time you run, the data is splitted randomly and hence you observe minor differences in your answer. You should use the random_state parameter of StratifiedKFold to make sure that the train data is splitted exactly same way each time.

See my other answer to know more about randomstate:

  • Classification results depend on random_state?
like image 108
Vivek Kumar Avatar answered Dec 27 '22 01:12

Vivek Kumar


For each run,the cv randomly split the train and validation set therefore results of each would be different.

like image 26
BoscoTsang Avatar answered Dec 27 '22 01:12

BoscoTsang