I was trying to get the optimum features for a decision tree classifier over the Iris dataset using sklearn.grid_search.GridSearchCV
. I used StratifiedKFold (sklearn.cross_validation.StratifiedKFold
) for cross-validation, since my data was biased. But on every execution of GridSearchCV
, it returned a different set of parameters.
Shouldn't it return the same set of optimum parameters given that the data and the cross-validation was same every single time?
Source code follows:
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(all_classes, n_folds=10)
grid_search = GridSearchCV(decision_tree_classifier, param_grid = parameter_grid,
cv = cross_validation)
grid_search.fit(all_inputs, all_classes)
print "Best Score: {}".format(grid_search.best_score_)
print "Best params: {}".format(grid_search.best_params_)
Outputs:
Best Score: 0.959731543624
Best params: {'max_features': 2, 'max_depth': 2}
Best Score: 0.973154362416
Best params: {'max_features': 3, 'max_depth': 5}
Best Score: 0.973154362416
Best params: {'max_features': 2, 'max_depth': 5}
Best Score: 0.959731543624
Best params: {'max_features': 3, 'max_depth': 3}
This is an excerpt from an Ipython notebook which I made recently, with reference to Randal S Olson's notebook, which can be found here.
Edit:
Its not the random_state
parameter of StratifiedKFold
which results in varied results but rather the random_state
parameter of DecisionTreeClassifer
which randomly initializes the tree and gives varied results (refer documentation). As for StratifiedKFold
, as long as the shuffle
parameter is set to False
(default), it generates the same training-test split (refer documentation).
The training results depend on the way the train data is splitted in cross validation. Each time you run, the data is splitted randomly and hence you observe minor differences in your answer.
You should use the random_state
parameter of StratifiedKFold to make sure that the train data is splitted exactly same way each time.
See my other answer to know more about randomstate:
For each run,the cv randomly split the train and validation set therefore results of each would be different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With