Am using Random Forest with scikit learn. RF overfits the data and prediction results are bad.
The overfit does NOT depend on the parameters of the RF: NBtree, Depth_Tree
Overfit happens with many different parameters (Tested it across grid_search).
To remedy: I tweak the initial data/ down sampling some results in order to affect the fitting (Manually pre-process noise sample).
Loop on random generation of RF fits,
Get RF prediction on the data for prediction
Select the model which best fits the "predicted data" (not the calibration data).
This Monte carlos is very consuming, Just wondering if there is another way to do cross validation on random Forest ? (ie NOT the hyper-parameter optimization).
EDITED
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data.
Does GridSearchCV use cross-validation? GridSearchCV does, in fact, do cross-validation. If I understand the notion correctly, you want to hide a portion of your data set from the model so that it may be tested. As a result, you train your models on training data and then test them on testing data.
Cross-Validation with any classifier in scikit-learn is really trivial:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
clf = RandomForestClassifier() #Initialize with whatever parameters you want to
# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid
, which according to the documentation is
Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
So maybe, you could define your param_grid
as follows:
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
Then you can use the GridSearchCV class as follows
from sklearn.model_selection import GridSearchCV
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)
You can then get the best model using grid_clf. best_estimator_
and the best parameters using grid_clf. best_params_
. Similarly you can get the grid scores using grid_clf.cv_results_
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With