How to tune parameters in Random Forest, using Scikit Learn?

Tags:

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,                                               criterion='gini',                                                max_depth=None,                                               min_samples_split=2,                                               min_samples_leaf=1,                                                min_weight_fraction_leaf=0.0,                                                max_features='auto',                                                max_leaf_nodes=None,                                                bootstrap=True,                                                oob_score=False,                                               n_jobs=1,                                                random_state=None,                                               verbose=0,                                                warm_start=False,                                                class_weight=None)

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

I tried different n_estimators and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_ array) increased dramatically.

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

989

asked Mar 19 '16 22:03

O.rka

1 Answers

From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:

n_estimators
max_features
criterion

n_estimators is not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.

max_features is worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.

criterion may have a small impact, but usually the default is fine. If you have the time, try it out.

Make sure to use sklearn's GridSearch (preferably GridSearchCV, but your data set size is too small) when trying out these parameters.

If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.

148

answered Sep 28 '22 00:09

Randy Olson

Related questions
                            
                                Pandas DataFrame: apply function to all columns
                            
                                Iterate over OrderedDict in Python
                            
                                Loading a .rds file in Pandas
                            
                                PIL: DLL load failed: specified procedure could not be found
                            
                                How to keep Unit tests and Integrations tests separate in pytest
                            
                                Increase resolution with word-cloud and remove empty border
                            
                                Python sqlite3 version
                            
                                get path of python binary that's executing the script [duplicate]
                            
                                List of pylint human readable message ids?
                            
                                Add zeros to a float after the decimal point in Python
                            
                                Pandas Dataframe display on a webpage
                            
                                How to install CUDA in Google Colab GPU's
                            
                                How can I check all the installed Python versions on Windows?
                            
                                How to set any font in reportlab Canvas in python?
                            
                                Interactive Python: cannot get `%lprun` to work, although line_profiler is imported properly
                            
                                python pandas loc - filter for list of values [duplicate]
                            
                                How to use virtualenv with python3.6 on ubuntu 16.04?
                            
                                "except Foo as bar" causes "bar" to be removed from scope [duplicate]
                            
                                Multithreading for Python Django
                            
                                How to make a simple multithreaded socket server in Python that remembers clients

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to tune parameters in Random Forest, using Scikit Learn?

Tags:

python

parameters

machine-learning

scikit-learn

random-forest

O.rka

People also ask

1 Answers

Randy Olson

Recent Activity

Donate For Us