I'm getting familiar with <code>r</code>'s <code>caret</code> package, but, coming from other programming language, it thorougly confused me. What I want to do now is a fairly simple machine learning workflow, which is: <ol> <li>Take a training set, in my case the iris dataset </li> <li>Split it into a training and test set (a 80-20 split)</li> <li>For every <code>k</code> from <code>1</code> to <code>20</code>, train the <code>k</code> nearest neighbor classifier on the training set</li> <li>Test it on the test set</li> </ol> I understand how to do the first part, since <code>iris</code> is already loaded. Then, the second part is done by calling <pre class="prettyprint"><code>a <- createDataPartition(iris$Species, list=FALSE) training <- iris[a,] test <- iris[-a,] </code></pre> Now, I also know that I can train the model by calling <pre class="prettyprint"><code>library(caret) knnFit <- train() knnFit <- train(Species~., data=training, method="knn") </code></pre> However, this will result in <code>r</code> already performing some optimisation on the parameter <code>k</code>. Of course, I can limit what values of <code>k</code> the method should try, with something like <pre class="prettyprint"><code>knnFit <- train(Species~., data=training, method="knn", tuneGrid=data.frame(k=1:20)) </code></pre> which works just fine, but it still doesn't to exactly what I want it to do. This code will now do, for each <code>k</code>: <ol> <li>take a bootstrap sample from the <code>test</code>.</li> <li>Asses the performance of the <code>k</code>-nn method using the given sample</li> </ol> What I want it to do: <ol> <li>For each <code>k</code>, train the model on the same train set which I constructed earlier </li> <li>Asses the performance **on the same test set which I constructed earlier.</li> </ol> So I would need something like <pre class="prettyprint"><code>knnFit <- train(Species~., training_data=training, test_data=test, method="knn", tuneGrid=data.frame(k=1:20)) </code></pre> but this of course does not work. I understand I should do something with the <code>trainControl</code> parameter, but I see its possible methods are: <pre class="prettyprint"><code>"boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV", "none" </code></pre> and none of these seems to do what I want.

Please read through the caret website to see how everything works. Or read the book "Applied Predictive Modeling" written by Max Kuhn for more info on how caret works. Roughly speaking, trainControl contains a diverse set of parameters for the train function, like cross-validation settings, metrics to apply (ROC / RMSE), sampling, preprocessing, etc. In train you can set additional settings like grid searches. I extended your code example so it works. Make sure to check how createDataPartition works, because the default setting splits the data in half. <pre class="prettyprint"><code>library(caret) a <- createDataPartition(iris$Species, p = 0.8, list=FALSE) training <- iris[a,] test <- iris[-a,] knnFit <- train(Species ~ ., data = training, method="knn", tuneGrid=data.frame(k=1:20)) knn_pred <- predict(knnFit, newdata = test) </code></pre> EDIT based on comment: What you want is not possible with one train object. Train will use the tunegrid to find the best k and use that outcome in the finalModel. This finalModel will be used for making predictions. If you want to have an overview of all k's you might not want to use caret's train function but write a function for yourself. Maybe something like below. Note that knn3 is a knn-model from caret. <pre class="prettyprint"><code>k <- 20 knn_fit_list <- list() knn_pred_list <- list() for (i in 1:k) { knn_fit_list[[i]] <- knn3(Species ~ ., data = training, k = i) knn_pred_list[[i]] <- predict(knn_fit_list[[i]], newdata = test, type = "class") } </code></pre> knn_fit_list will contain all the fitted models for the specified number of k. knn_pred_list will contain all the predictions.

Train test split in `r`'s `caret` package

Tags:

r

r-caret

I'm getting familiar with r's caret package, but, coming from other programming language, it thorougly confused me.

What I want to do now is a fairly simple machine learning workflow, which is:

Take a training set, in my case the iris dataset
Split it into a training and test set (a 80-20 split)
For every k from 1 to 20, train the k nearest neighbor classifier on the training set
Test it on the test set

I understand how to do the first part, since iris is already loaded. Then, the second part is done by calling

Click to copy

a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]

Now, I also know that I can train the model by calling

Click to copy

library(caret)
knnFit <- train()
knnFit <- train(Species~., data=training, method="knn")

However, this will result in r already performing some optimisation on the parameter k. Of course, I can limit what values of k the method should try, with something like

Click to copy

knnFit <- train(Species~., data=training, method="knn", tuneGrid=data.frame(k=1:20))

which works just fine, but it still doesn't to exactly what I want it to do. This code will now do, for each k:

take a bootstrap sample from the test.
Asses the performance of the k-nn method using the given sample

What I want it to do:

For each k, train the model on the same train set which I constructed earlier
Asses the performance **on the same test set which I constructed earlier.

So I would need something like

Click to copy

knnFit <- train(Species~., training_data=training, test_data=test, method="knn", tuneGrid=data.frame(k=1:20))

but this of course does not work.

I understand I should do something with the trainControl parameter, but I see its possible methods are:

Click to copy

"boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV", "none"

and none of these seems to do what I want.

494

asked Mar 01 '16 08:03

5xum

2 Answers

If I understand the question correctly, this can be done all within caret using LGOCV (Leave-group-out-CV = repeated train/test split) and setting the training percentage p = 0.8 and the repeats of the train/test split to number = 1 if you really want just one model fit per k that is tested on a testset. Setting number > 1 will repeatedly assess model performance on number different train/test splits.

Click to copy

data(iris)
library(caret)
set.seed(123)
mod <- train(Species ~ ., data = iris, method = "knn", 
             tuneGrid = expand.grid(k=1:20),
             trControl = trainControl(method = "LGOCV", p = 0.8, number = 1,
                                      savePredictions = T))

All predictions that have been made by the different models on the test set are in mod$pred if savePredictions = T. Note rowIndex: These are the rows that have been sampled into the test set. Those are equal for all different values of k, so the same training/test sets are used every time.

Click to copy

> head(mod$pred)
    pred    obs rowIndex k  Resample
1 setosa setosa        5 1 Resample1
2 setosa setosa        6 1 Resample1
3 setosa setosa       10 1 Resample1
4 setosa setosa       12 1 Resample1
5 setosa setosa       16 1 Resample1
6 setosa setosa       17 1 Resample1
> tail(mod$pred)
         pred       obs rowIndex  k  Resample
595 virginica virginica      130 20 Resample1
596 virginica virginica      131 20 Resample1
597 virginica virginica      135 20 Resample1
598 virginica virginica      137 20 Resample1
599 virginica virginica      145 20 Resample1
600 virginica virginica      148 20 Resample1

There's no need to construct train/test sets manually outside of caret unless some kind of nested validation prodedure is desired. You can also plot the validation-curve for the different values of k by plot(mod).

answered Sep 30 '22 02:09

thie1e

Please read through the caret website to see how everything works. Or read the book "Applied Predictive Modeling" written by Max Kuhn for more info on how caret works.

Roughly speaking, trainControl contains a diverse set of parameters for the train function, like cross-validation settings, metrics to apply (ROC / RMSE), sampling, preprocessing, etc.

In train you can set additional settings like grid searches. I extended your code example so it works. Make sure to check how createDataPartition works, because the default setting splits the data in half.

Click to copy

library(caret)

a <- createDataPartition(iris$Species, p = 0.8, list=FALSE)
training <- iris[a,]
test <- iris[-a,]

knnFit <- train(Species ~ ., 
                data = training, 
                method="knn",  
                tuneGrid=data.frame(k=1:20))

knn_pred <- predict(knnFit, newdata = test)

EDIT based on comment:

What you want is not possible with one train object. Train will use the tunegrid to find the best k and use that outcome in the finalModel. This finalModel will be used for making predictions.

If you want to have an overview of all k's you might not want to use caret's train function but write a function for yourself. Maybe something like below. Note that knn3 is a knn-model from caret.

Click to copy

k <- 20
knn_fit_list <- list()
knn_pred_list <- list()

for (i in 1:k) {
  knn_fit_list[[i]] <- knn3(Species ~ ., 
                            data = training, 
                            k = i)
  knn_pred_list[[i]] <- predict(knn_fit_list[[i]], newdata = test, type = "class")

}

knn_fit_list will contain all the fitted models for the specified number of k. knn_pred_list will contain all the predictions.

answered Sep 30 '22 02:09

phiver

Related questions
                            
                                Trouble understanding how stack() works
                            
                                Test for Multicollinearity in Panel Data R
                            
                                Combining polygons and calculating their area (i.e. number of cells) in R
                            
                                How to get all the sum in aggregate function?
                            
                                Converting spatial polygon to regular data frame without use of gpclib tools
                            
                                Direct update (replace) of sparse data frame is slow and inefficient
                            
                                Tree cut and Rectangles around clusters for a horizontal dendrogram in R
                            
                                Rpy2 error wac-a-mole: R_USER not defined
                            
                                Overlay two geom_bar like two barplots with par(new=TRUE)
                            
                                Extracting coefficients and their standard error for each unit in an lme model fit
                            
                                Saving results from for loop as a vector in r
                            
                                Permission Denied Error when downloading a file
                            
                                Split a vector into unequal chunks in R
                            
                                Insert a row of NAs after each group of data using data.table
                            
                                outptut two objects using foreach
                            
                                How to install R 3.1.2 on Linux Mint 17.1
                            
                                Split a string by a plus sign (+) character
                            
                                Align a double line chart and a bar plot on the x axis when both charts have the same X axis. ggplot2
                            
                                How to handle null entries in SparkR
                            
                                ggplot2: How to combine histogram, rug plot, and logistic regression prediction in a single graph

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Train test split in `r`'s `caret` package

Tags:

r

r-caret

5xum

People also ask

2 Answers

thie1e

phiver

Recent Activity

Donate For Us