Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tune parameters in Random Forest, using Scikit Learn?

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,                                               criterion='gini',                                                max_depth=None,                                               min_samples_split=2,                                               min_samples_leaf=1,                                                min_weight_fraction_leaf=0.0,                                                max_features='auto',                                                max_leaf_nodes=None,                                                bootstrap=True,                                                oob_score=False,                                               n_jobs=1,                                                random_state=None,                                               verbose=0,                                                warm_start=False,                                                class_weight=None) 

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

I tried different n_estimators and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_ array) increased dramatically.

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

like image 989
O.rka Avatar asked Mar 19 '16 22:03

O.rka


People also ask

What parameter needs tuning in the Random Forest method?

The most important hyper-parameters of a Random Forest that can be tuned are: The Nº of Decision Trees in the forest (in Scikit-learn this parameter is called n_estimators) The criteria with which to split on each node (Gini or Entropy for a classification task, or the MSE or MAE for regression)

How do you choose best parameters for Random Forest classifier?

The resulting “best” hyperparameters are as follows: max_depth = 15, min_samples_leaf = 1, min_samples_split = 2, n_estimators = 500. Again, a new Random Forest Classifier was run using these values as hyperparameters inputs. This model also resulted in an accuracy of 0.993076923077 when tested using the testing set.


1 Answers

From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:

  • n_estimators

  • max_features

  • criterion

n_estimators is not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.

max_features is worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.

criterion may have a small impact, but usually the default is fine. If you have the time, try it out.

Make sure to use sklearn's GridSearch (preferably GridSearchCV, but your data set size is too small) when trying out these parameters.

If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.

like image 148
Randy Olson Avatar answered Sep 28 '22 00:09

Randy Olson