this is my first question ever here I hope I am doing this right,
I was working on titanic dataset which is popular on kaggle, this tutarial if u wanna check A Data Science Framework: To Achieve 99% Accuracy
the part 5.2, it teaches how to gridsearch and tune hyper-parameters. let me share related codes with you before I get spesific on my question;
this is tuning the model with GridSearchCV:
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)
#cv_split = model_selection.KFold(n_splits=10, shuffle=False, random_state=None)
param_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
'random_state': [0] }
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', return_train_score = True ,cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])`
tune_model.best_params_
result is:
{'criterion': 'gini',
'max_depth': 4,
'max_features': None,
'min_samples_leaf': 5,
'min_samples_split': 2,
'random_state': 0,
'splitter': 'best'}
and acording to code, training and test accuracy sopposed to be like that when tuned with those:
print(tune_model.cv_results_['mean_train_score'][tune_model.best_index_], tune_model.cv_results_['mean_test_score'][tune_model.best_index_])
output of this: 0.8924916598172832 0.8767742588186237
out of curiousity, I wanted to make my own DecisionTreeClassifier() with parameters I got from GridSearchCV,
dtree = tree.DecisionTreeClassifier(criterion = 'gini',max_depth = 4,max_features= None, min_samples_leaf= 5, min_samples_split= 2,random_state = 0, splitter ='best')
results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target],return_train_score = True, cv = cv_split)
Same hyperparameters, same cross validation dataframes, different results. Why?
print(results['train_score'].mean(), results['test_score'].mean())
0.8387640449438202 0.8227611940298509
that one was tune_model results:
0.8924916598172832 0.8767742588186237
difference is not even small. Both results should be same if u ask me,
I don'T understand what is different? what is different so results are different?
I tried cross validating with k-fold instead of shufflesplit,
in both scenarios I tried with different random_state values, tried also random_state = None,
still different results.
can someone explain the difference please?
edit: btw, I also wanted to check test sample results:
dtree.fit(data1[data1_x_bin],data1[Target])
dtree.score(test1_x_bin,test1_y), tune_model.score(test1_x_bin,test1_y)
output: (0.8295964125560538, 0.9033059266872216)
same models(decisiontreeclassifier), same hyper-parameters, very different results
( obviously they are not same models but I can't see how and why )
But more often than not, the accuracy can improve with hyperparameter tuning. Hyperparameter tuning is a lengthy process of increasing the model accuracy by tweaking the hyperparameters — values that can't be learned and need to be specified before the training.
GridSearchCV is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. As mentioned above, the performance of a model significantly depends on the value of hyperparameters.
By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
Update
By default cross_validate
uses the estimators score method as default to evaluate its performance (you can change that by specifiying the scoring
kw argument of cross validate
). The score method of the DecisionTreeClassifier
class uses accuracy as its score metric. Within the GridSearchCV roc_auc
is used as the score metric. Using the same score metric in both cases results in identical scores. E.g. if the score metric of cross_validate
ist changed to roc_auc
the score difference you observed between models vanishes.
results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target], scoring = 'roc_auc' ... )
Regarding score metrics:
The choice of the score metric determines how the performance of a model is evaluated.
Imagine a model should predict whether a traffic light is green (traffic light is green -> 1, traffic light is not green -> 0). This model can make two types of mistakes. Either it says the traffic light is green although it is not green (false positive) or it says the traffic light is not green although it is green (false negative). In this case, a false negative would be ugly, but bearable in its consequences (somebody has to wait longer at the traffic light than necessary). False positives, on the other hand, would be catastrophic (someone passes the traffic light red because it has been classified as green). In order to evaluate the model's performance, a score metric would be chosen which weighs false positives higher (i.e. classifies them as "worse" errors) than false negatives. Accuracy would be an unsuitable metric here, because false negatives and false positives would lower the score to the same extent. More suitable as a score metric would be, for example, precision. This metric weighs false positives with 1 and false negatives with 0 (the number of false negatives has no influence on the precision of a model). For a good overview what false negatives, false positives, precision, recall, accuracy etc. are see here. The beta parameter of the F score (another score metric) can be used to set how false positives should be weighted compared to false negatives (for a more detailed explanation, see here). More information about the roc_auc
score can be found here (it is calculated from different statistics of the confusion matrix).
In summary, this means that the same model can perform very well in relation to one score metric, while it performs poorly in relation to another. In the case you described, the decision tree optimized by GridSearchCV and the tree you instantiated afterwards are identical models. Both yield identical accuracys or identical roc_auc
scores. Which score metric you use to compare the performance of different models on your data set depends on which criteria you consider to be particularly important for model performance. If the only criterium is how many instances have been classified correctly, accuracy is a probably a good choice.
Old Idea (see comments):
You specified a random state for dtree
(dtree = tree.DecisionTreeClassifier(random_state = 0 ...
) , but none for the decision tree used in the GridSearchCV. Use the same random state there and let me know if that solved the problem.
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(random_state=0), ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With