Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specific Cross Validation with Random Forest

Tags:

scikit-learn

Am using Random Forest with scikit learn. RF overfits the data and prediction results are bad.

The overfit does NOT depend on the parameters of the RF: NBtree, Depth_Tree

Overfit happens with many different parameters (Tested it across grid_search).

To remedy: I tweak the initial data/ down sampling some results in order to affect the fitting (Manually pre-process noise sample).

Loop on random generation of RF fits, 

Get RF prediction on the  data for prediction
Select the model which best fits the "predicted data" (not the calibration data).

This Monte carlos is very consuming, Just wondering if there is another way to do cross validation on random Forest ? (ie NOT the hyper-parameter optimization).

EDITED

like image 989
Brook Avatar asked Jul 01 '16 18:07

Brook


People also ask

Can I use cross-validation with random forest?

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data.

Does GridSearchCV do cross-validation?

Does GridSearchCV use cross-validation? GridSearchCV does, in fact, do cross-validation. If I understand the notion correctly, you want to hide a portion of your data set from the model so that it may be tested. As a result, you train your models on training data and then test them on testing data.


1 Answers

Cross-Validation with any classifier in scikit-learn is really trivial:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

clf = RandomForestClassifier() #Initialize with whatever parameters you want to

# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))

If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is

Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

So maybe, you could define your param_grid as follows:

param_grid = {
                 'n_estimators': [5, 10, 15, 20],
                 'max_depth': [2, 5, 7, 9]
             }

Then you can use the GridSearchCV class as follows

from sklearn.model_selection import GridSearchCV

grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)

You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_

Hope this helps!

like image 182
Abhinav Arora Avatar answered Sep 20 '22 12:09

Abhinav Arora