I'm using R package <code>randomForest</code> to do a regression on some biological data. My training data size is <code>38772 X 201</code>. I just wondered---what would be a good value for the number of trees <code>ntree</code> and the number of variable per level <code>mtry</code>? Is there an approximate formula to find such parameter values? Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

The short answer is no. The <code>randomForest</code> function of course has default values for both <code>ntree</code> and <code>mtry</code>. The default for <code>mtry</code> is often (but not always) sensible, while generally people will want to increase <code>ntree</code> from it's default of 500 quite a bit. The "correct" value for <code>ntree</code> generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees. You can spend (read: waste) a lot of time tinkering with things like <code>mtry</code> (and <code>sampsize</code> and <code>maxnodes</code> and <code>nodesize</code> etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all. The caret package has a very general function <code>train</code> that allows you to do a simple grid search over parameter values like <code>mtry</code> for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that. Also, somehow I forgot that the ranfomForest package itself has a <code>tuneRF</code> function that is specifically for searching for the "optimal" value for <code>mtry</code>.

setting values for ntree and mtry for random forest regression model

2 Answers

The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.

There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.

Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.

For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.

173

answered Oct 11 '22 18:10

Jeffrey Evans

The short answer is no.

The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.

The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.

You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.

The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.

Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.

answered Oct 11 '22 18:10

joran

Related questions
                            
                                In R markdown in RStudio, how can I prevent the source code from running off a pdf page?
                            
                                lapply with "$" function
                            
                                What's the difference in using a semicolon or explicit new line in R code
                            
                                Difference between c() and append()
                            
                                Add link to R Shiny Application so link opens in a new browser tab
                            
                                Rbuildignore and Excluding Directories
                            
                                Complete remove and reinstall R, including all packages
                            
                                Replace single backslash in R
                            
                                Why do powers of 10 print in scientific notation at the 5th power?
                            
                                Is there a vectorized parallel max() and min()?
                            
                                Perform a semi-join with data.table
                            
                                Add color to boxplot - "Continuous value supplied to discrete scale" error
                            
                                ggplot2 cheat sheet [closed]
                            
                                including a interactive 3D figure with knitr
                            
                                Recommended package for very large dataset processing and machine learning in R [closed]
                            
                                how to adjust future.global.maxSize in R?
                            
                                Converting units in R
                            
                                Capturing Rscript errors in an output file
                            
                                Why is the parallel package slower than just using apply?
                            
                                different size facets proportional of x axis on ggplot 2 r

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

setting values for ntree and mtry for random forest regression model

Tags:

r

machine-learning

statistics

regression

random-forest

DOSMarter

People also ask

2 Answers

Jeffrey Evans

joran

Recent Activity

Donate For Us