Can sklearn Random Forest classifier adjust sample size by tree, to handle class imbalance?

Tags:

Perhaps this is too long-winded. Simple question about sklearn's random forest:

For a true/false classification problem, is there a way in sklearn's random forest to specify the sample size used to train each tree, along with the ratio of true to false observations?

More details are below:

In the R implementation of random forest, called randomForest, there's an option sampsize(). This allows you to balance the sample used to train each tree based on the outcome.

For example, if you're trying to predict whether an outcome is true or false and 90% of the outcomes in the training set are false, you can set sampsize(500, 500). This means that each tree will be trained on a random sample (with replacement) from the training set with 500 true and 500 false observations. In these situations, I've found models perform much better predicting true outcomes when using a 50% cut-off, yielding much higher kappas.

It doesn't seem like there is an option for this in the sklearn implementation.

Is there any way to mimic this functionality in sklearn?
Would simply optimizing the cut-off based on the Kappa statistic achieve a similar result or is something lost in this approach?

902

asked Nov 27 '13 20:11

Luke

1 Answers

In version 0.16-dev, you can now use class_weight="auto" to have something close to what you want to do. This will still use all samples, but it will reweight them so that classes become balanced.

174

answered Sep 24 '22 03:09

Gilles Louppe

Related questions
                            
                                Force background of matplotlib figure to be transparent
                            
                                How to make new cells in ipython notebook markdown by default?
                            
                                Proper way to clean up a Python service -- atexit, signal, try/finally
                            
                                Plot contours for the densest region of a scatter plot
                            
                                An easy way to mock loosely defined Python dict objects
                            
                                Pythonic way of writing a library function which accepts multiple types?
                            
                                what does [...] mean as an output in python? [duplicate]
                            
                                Divide entire pandas multiIndex dataframe by dataframe variable
                            
                                String parsing using Python?
                            
                                array slicing in numpy
                            
                                Segmentation Fault in Pandas read_csv
                            
                                Getting correct timestamp from cassandra using datastax python-driver
                            
                                Python - Is the grammar for 3.0 the same as 3.3?
                            
                                Python callback function placeholders?
                            
                                pandas.series.copy doesn't create new object
                            
                                Clean Python multiprocess termination dependant on an exit flag
                            
                                Python shutil.copy fails on FAT file systems (Ubuntu)
                            
                                Python wait and check if file is created completely by external program
                            
                                Issues with pyinstaller and pyproj
                            
                                Why doesn't globals work as I would expect when importing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can sklearn Random Forest classifier adjust sample size by tree, to handle class imbalance?

Tags:

python

r

classification

scikit-learn

random-forest

Luke

People also ask

1 Answers

Gilles Louppe

Recent Activity

Donate For Us