Specific Cross Validation with Random Forest

Tags:

scikit-learn

Am using Random Forest with scikit learn. RF overfits the data and prediction results are bad.

The overfit does NOT depend on the parameters of the RF: NBtree, Depth_Tree

Overfit happens with many different parameters (Tested it across grid_search).

To remedy: I tweak the initial data/ down sampling some results in order to affect the fitting (Manually pre-process noise sample).

Loop on random generation of RF fits, 

Get RF prediction on the  data for prediction
Select the model which best fits the "predicted data" (not the calibration data).

This Monte carlos is very consuming, Just wondering if there is another way to do cross validation on random Forest ? (ie NOT the hyper-parameter optimization).

EDITED

989

asked Jul 01 '16 18:07

Brook

1 Answers

Cross-Validation with any classifier in scikit-learn is really trivial:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

clf = RandomForestClassifier() #Initialize with whatever parameters you want to

# 10-Fold Cross validation
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))

If you wish to run Grid Search, you can easily do it via the GridSearchCV class. In order to do so you will have to provide a param_grid, which according to the documentation is

Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

So maybe, you could define your param_grid as follows:

param_grid = {
                 'n_estimators': [5, 10, 15, 20],
                 'max_depth': [2, 5, 7, 9]
             }

Then you can use the GridSearchCV class as follows

from sklearn.model_selection import GridSearchCV

grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)

You can then get the best model using grid_clf. best_estimator_ and the best parameters using grid_clf. best_params_. Similarly you can get the grid scores using grid_clf.cv_results_

Hope this helps!

182

answered Sep 20 '22 12:09

Abhinav Arora

Related questions
                            
                                upgrade to dev version of scikit-learn on Anaconda?
                            
                                What is the difference between LinearSVC and SVC(kernel="linear")?
                            
                                Sklearn logistic regression, plotting probability curve graph
                            
                                Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?
                            
                                How to fit a polynomial curve to data using scikit-learn?
                            
                                Pass tokens to CountVectorizer
                            
                                Loading SKLearn cancer dataset into Pandas DataFrame
                            
                                python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc
                            
                                How to perform logistic lasso in python?
                            
                                PCA inverse transform manually
                            
                                Keep pandas structure with numpy/scikit functions
                            
                                How to use inverse_transform in MinMaxScaler for a column in a matrix
                            
                                Is it possible to use Google BERT to calculate similarity between two textual documents?
                            
                                Proximity Matrix in sklearn.ensemble.RandomForestClassifier
                            
                                Python sklearn - how to calculate p-values
                            
                                Sklearn custom transformers: difference between using FunctionTransformer and subclassing TransformerMixin
                            
                                Is a countvectorizer the same as tfidfvectorizer with use_idf=false?
                            
                                python scikit error - no module named sklearn
                            
                                scikit-learn cross validation custom splits for time series data
                            
                                Confusion matrix on images in CNN keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With