I have made a start to create some training and test sets using 10 fold crossvalidation for an artificial dataset:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
data <- data.frame(X1, true.presence)
id <- sample(1:10,nrow(data),replace=TRUE)
ListX <- split(data,id)
fold1 <- data[id==1,]
fold2 <- data[id==2,]
fold3 <- data[id==3,]
fold4 <- data[id==4,]
fold5 <- data[id==5,]
fold6 <- data[id==6,]
fold7 <- data[id==7,]
fold8 <- data[id==8,]
fold9 <- data[id==9,]
fold10 <- data[id==10,]
trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10))
testset <- subset(data, id %in% c(1))
I am just wondering whether there are easier ways to achieve this and how I could perform stratified crossvalidation which ensures that the class priors (true.presence) are roughly the same in all folds?
I found splitTools is pretty useful, hope the vignette https://cran.r-project.org/web/packages/splitTools/vignettes/splitTools.html can help anyone interested in this topic.
> y <- rep(c(letters[1:4]), each = 5)
> y
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "d" "d" "d" "d" "d"
> create_folds(y)
$Fold1
[1] 1 2 3 5 6 7 8 10 12 13 14 15 17 18 19 20
$Fold2
[1] 1 2 4 5 6 8 9 10 11 12 13 14 16 17 19 20
$Fold3
[1] 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 20
$Fold4
[1] 1 2 3 4 7 8 9 10 11 13 14 15 16 18 19 20
$Fold5
[1] 1 3 4 5 6 7 8 9 11 12 14 15 16 17 18 19
> create_folds(y, m_rep = 3)
$Fold1.Rep1
[1] 1 2 4 5 6 7 8 10 11 12 13 15 16 17 19 20
$Fold2.Rep1
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 20
$Fold3.Rep1
[1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
$Fold4.Rep1
[1] 1 2 3 4 6 7 9 10 11 13 14 15 16 18 19 20
$Fold5.Rep1
[1] 1 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19
$Fold1.Rep2
[1] 1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold2.Rep2
[1] 1 2 3 4 6 7 8 10 11 12 14 15 17 18 19 20
$Fold3.Rep2
[1] 2 3 4 5 6 7 8 9 12 13 14 15 16 17 19 20
$Fold4.Rep2
[1] 1 3 4 5 7 8 9 10 11 13 14 15 16 17 18 20
$Fold5.Rep2
[1] 1 2 4 5 6 7 9 10 11 12 13 15 16 18 19 20
$Fold1.Rep3
[1] 1 2 3 4 6 7 9 10 11 12 13 15 16 18 19 20
$Fold2.Rep3
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold3.Rep3
[1] 1 2 4 5 6 7 8 9 11 12 14 15 16 17 19 20
$Fold4.Rep3
[1] 1 2 3 5 7 8 9 10 12 13 14 15 17 18 19 20
$Fold5.Rep3
[1] 1 3 4 5 6 7 8 10 11 13 14 15 16 17 18 20
createFolds
method of caret
package performs a stratified partitioning. Here is a paragraph from the help page:
... The random sampling is done within the levels of y (=outcomes) when y is a factor in an attempt to balance the class distributions within the splits.
Here is the answer of your problem:
library(caret)
folds <- createFolds(factor(data$true.presence), k = 10, list = FALSE)
and the proportions:
> library(plyr)
> data$fold <- folds
> ddply(data, 'fold', summarise, prop=mean(true.presence))
fold prop
1 1 0.5000000
2 2 0.5050505
3 3 0.5000000
4 4 0.5000000
5 5 0.5000000
6 6 0.5049505
7 7 0.5000000
8 8 0.5049505
9 9 0.5000000
10 10 0.5050505
@joran is right (regarding his assumption (b)). dismo::kfold() is what you are looking for.
So using data
from the initial question:
require(dismo)
folds <- kfold(data, k=10, by=data$true.presence)
gives a vector of length nrow(data)
containing the fold association of each row of data.
Hence, data[fold==1,]
returns the 1st fold and data[fold!=1,]
can be used for validation.
edit 6/2018: I strongly support using the caret package as recommended by @gkcn. It is better integrated in the tidyverse workflow and more actively developed. Go with that!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With