How do I solve overfitting in random forest of Python sklearn?

Tags:

I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations:

Fold 1 : Train: 164  Test: 40 Train Accuracy: 0.914634146341 Test Accuracy: 0.55  Fold 2 : Train: 163  Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0.707317073171  Fold 3 : Train: 163  Test: 41 Train Accuracy: 0.889570552147 Test Accuracy: 0.585365853659  Fold 4 : Train: 163  Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0.756097560976  Fold 5 : Train: 163  Test: 41 Train Accuracy: 0.883435582822 Test Accuracy: 0.512195121951

I am using "Price" feature to predict "quality" which is a ordinal value. In each cross validation, there are 163 training examples and 41 test examples.

Apparently, overfitting occurs here. So is there any parameters provided by sklearn can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.

Thanks in advance!

321

asked Dec 09 '13 04:12

Munichong

1 Answers

I would agree with @Falcon w.r.t. the dataset size. It's likely that the main problem is the small size of the dataset. If possible, the best thing you can do is get more data, the more data (generally) the less likely it is to overfit, as random patterns that appear predictive start to get drowned out as the dataset size increases.

That said, I would look at the following params:

n_estimators: @Falcon is wrong, in general the more trees the less likely the algorithm is to overfit. So try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
max_features: try reducing this number (try 30-50% of the number of features). This determines how many features each tree is randomly assigned. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
max_depth: Experiment with this. This will reduce the complexity of the learned models, lowering over fitting risk. Try starting small, say 5-10, and increasing you get the best result.
min_samples_leaf: Try setting this to values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

Note when doing this work to be scientific. Use 3 datasets, a training set, a separate 'development' dataset to tweak your parameters, and a test set that tests the final model, with the optimal parameters. Only change one parameter at a time and evaluate the result. Or experiment with the sklearn gridsearch algorithm to search across these parameters all at once.

answered Sep 28 '22 04:09

Simon

Related questions
                            
                                SQLAlchemy or psycopg2?
                            
                                Using openCV to overlay transparent image onto another image
                            
                                Python: Regular expression to match alpha-numeric not working?
                            
                                Extract / Identify Tables from PDF python [closed]
                            
                                Save base64 image in django file field
                            
                                Replace whole string if it contains substring in pandas
                            
                                Suppress the u'prefix indicating unicode' in python strings
                            
                                Are Generators Threadsafe?
                            
                                Adding model-wide help text to a django model's admin form
                            
                                Python MySQLDB: Get the result of fetchall in a list
                            
                                List append() in for loop [duplicate]
                            
                                Are there any alternatives to py2exe? [closed]
                            
                                Reduce a key-value pair into a key-list pair with Apache Spark
                            
                                Tqdm 4.28.1 in Jupyter Notebook "IntProgress not found. Please update jupyter and ipywidgets."
                            
                                A way to output pyunit test name in setup()
                            
                                Python pip install module is not found. How to link python to pip location?
                            
                                How to set up a Django project in PyCharm
                            
                                AttributeError: module 'tensorflow' has no attribute 'reset_default_graph'
                            
                                Get date from ISO week number in Python [duplicate]
                            
                                How do you set the absolute position of figure windows with matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I solve overfitting in random forest of Python sklearn?

Tags:

python

machine-learning

scikit-learn

decision-tree

random-forest

Munichong

People also ask

1 Answers

Simon

Recent Activity

Donate For Us