I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce. So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time. <pre class="prettyprint"><code>ID RT (seconds) 1_MW 2_AMW 3_Sv 4_Se 4281 38 145.29 5.01 14.76 28.37 4952 40 132.19 6.29 11 21.28 4823 41 176.21 7.34 12.9 24.92 3840 41 174.24 6.7 13.99 26.48 3665 42 240.34 9.24 15.2 27.08 3591 42 161.23 6.2 13.71 26.27 3659 42 146.22 6.09 12.6 24.16 </code></pre> This is an example of the table that I have. I want to basically plot RT against 1_MW, etc (up to 1664 variables), so I can find which of these variables are of importance and which aren't. I do:- <pre class="prettyprint"><code>r = randomForest(RT..seconds.~., data = cadets, importance =TRUE, do.trace = 100) varImpPlot(r) </code></pre> which tells me which variables are of importance and what not, which is great. However, I want to be able to partition my dataset so that I can perform cross validation on it. I found an online tutorial that explained how to do it, but for a classification model rather than regression. I understand you do:- <pre class="prettyprint"><code>k = 10 n = floor(nrow(cadets)/k) i = 1 s1 = ((i-1) * n+1) s2 = (i * n) subset = s1:s2 </code></pre> to define how many cross folds you want to do, and the size of each fold, and to set the starting and end value of the subset. However, I don't know what to do here on after. I was told to loop through but I honestly have no idea how to do this. Nor do I know how to then plot the validation set and the test set onto the same graph to depict the level of accuracy/error. If you could please help me with this I'd be ever so grateful, thanks!

As topchef pointed out, cross-validation isn't necessary as a guard against over-fitting. This is a nice feature of the random forest algorithm. It sounds like your goal is feature selection, cross-validation is still useful for this purpose. Take a look at the <code>rfcv()</code> function within the randomForest package. Documentation specifies input of a data frame & vector, so I'll start by creating those with your data. <pre class="prettyprint"><code>set.seed(42) x <- cadets x$RT..seconds. <- NULL y <- cadets$RT..seconds. rf.cv <- rfcv(x, y, cv.fold=10) with(rf.cv, plot(n.var, error.cv)) </code></pre>

How to perform random forest/cross validation in R

Tags:

I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce.

So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time.

ID  RT (seconds)    1_MW    2_AMW   3_Sv    4_Se 4281    38  145.29  5.01    14.76   28.37 4952    40  132.19  6.29    11  21.28 4823    41  176.21  7.34    12.9    24.92 3840    41  174.24  6.7 13.99   26.48 3665    42  240.34  9.24    15.2    27.08 3591    42  161.23  6.2 13.71   26.27 3659    42  146.22  6.09    12.6    24.16

This is an example of the table that I have. I want to basically plot RT against 1_MW, etc (up to 1664 variables), so I can find which of these variables are of importance and which aren't.

I do:-

r = randomForest(RT..seconds.~., data = cadets, importance =TRUE, do.trace = 100) varImpPlot(r)

which tells me which variables are of importance and what not, which is great. However, I want to be able to partition my dataset so that I can perform cross validation on it. I found an online tutorial that explained how to do it, but for a classification model rather than regression.

I understand you do:-

k = 10 n = floor(nrow(cadets)/k) i = 1 s1 = ((i-1) * n+1) s2 = (i * n) subset = s1:s2

to define how many cross folds you want to do, and the size of each fold, and to set the starting and end value of the subset. However, I don't know what to do here on after. I was told to loop through but I honestly have no idea how to do this. Nor do I know how to then plot the validation set and the test set onto the same graph to depict the level of accuracy/error.

If you could please help me with this I'd be ever so grateful, thanks!

858

asked Nov 04 '13 01:11

user2062207

2 Answers

From the source:

The out-of-bag (oob) error estimate

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally , during the run...

In particular, predict.randomForest returns the out-of-bag prediction if newdata is not given.

128

answered Sep 19 '22 08:09

topchef

As topchef pointed out, cross-validation isn't necessary as a guard against over-fitting. This is a nice feature of the random forest algorithm.

It sounds like your goal is feature selection, cross-validation is still useful for this purpose. Take a look at the rfcv() function within the randomForest package. Documentation specifies input of a data frame & vector, so I'll start by creating those with your data.

set.seed(42) x <- cadets x$RT..seconds. <- NULL y <- cadets$RT..seconds.  rf.cv <- rfcv(x, y, cv.fold=10)  with(rf.cv, plot(n.var, error.cv))

answered Sep 21 '22 08:09

Lenwood

Related questions
                            
                                how to restart node application automatically on aws elastic-beanstalk
                            
                                Multiple typename arguments in c++ template?
                            
                                Build VS2013 on a TFS Build Server With Only VS2013
                            
                                WPF DataGrid: How to Determine the Current Row Index?
                            
                                How to reset / change password in Node.js with Passport.js?
                            
                                endian.h not found on mac osx
                            
                                Mysql query to get current date from database? [closed]
                            
                                How do i use TinyMCE jQuery package and what is the difference with TinyMCE jQuery plugin
                            
                                jquery keyup detect paste text from input
                            
                                AudioSessionSetProperty deprecated in iOS 7.0 so how set kAudioSessionProperty_OverrideCategoryMixWithOthers
                            
                                Finding separate graphs within a graph object in networkx
                            
                                How to create a popup window in javafx [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With