Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: <code>factor x has new levels Y</code>. For example, using package boot: <pre class="prettyprint lang- prettyprint-override"><code>library(boot) d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6)) m <- glm(y ~ x, data=d) m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds m.cv <- cv.glm(d, m, K=2) # Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : # factor x has new levels B </code></pre> <hr> Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level <code>C</code> but none of them is present in the training partition. <hr> The function <code>createDataPartition</code> function from the package <code>caret</code> does stratified sampling for the outcome variables and correctly warns: <blockquote> Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data. </blockquote> There are two solutions which spring to mind: <ol> <li>First, create a subset of the data by selecting one random sample of each <code>factor level</code> first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using <code>createDataPartition</code> on the rest of the dataset and merging the results to create a new train dataset which contains all <code>levels</code>.</li> <li>Using <code>createDataPartitions</code> and and doing rejection sampling.</li> </ol> So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one. Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions. Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling? Please leave a comment if I should ask this question on stats.stackoverflow.com. <hr> Update: This is what my hand rolled out solution (2) looks like: <pre class="prettyprint"><code>get.cv.idx <- function(train.data, folds, factor.cols = NA) { if (is.na(factor.cols)) { all.cols <- colnames(train.data) factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)] } n <- nrow(train.data) test.n <- floor(1 / folds * n) cond.met <- FALSE n.tries <- 0 while (!cond.met) { n.tries <- n.tries + 1 test.idx <- sample(nrow(train.data), test.n) train.idx <- setdiff(1:nrow(train.data), test.idx) cond.met <- TRUE for(factor.col in factor.cols) { train.levels <- train.data[ train.idx, factor.col ] test.levels <- train.data[ test.idx , factor.col ] if (length(unique(train.levels)) < length(unique(test.levels))) { cat('Factor level: ', factor.col, ' violated constraint, retrying.\n') cond.met <- FALSE } } } cat('Done in ', n.tries, ' trie(s).\n') list( train.idx = train.idx , test.idx = test.idx ) } </code></pre>

Everyone agrees that there sure is an optimal solution. But personally, I would just <code>try</code> the <code>cv.glm</code> call until it works using<code>while</code>. <pre class="prettyprint"><code>m.cv<- try(cv.glm(d, m, K=2)) #First try class(m.cv) #Sometimes error, sometimes list while ( inherits(m.cv, "try-error") ) { m.cv<- try(cv.glm(d, m, K=2)) } class(m.cv) #always list </code></pre> I've tried it with 100,000 rows in the data.fame and it only takes a few seconds. <pre class="prettyprint"><code>library(boot) n <-100000 d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2)) m <- glm(y ~ x, data=d) m.cv<- try(cv.glm(d, m, K=2)) class(m.cv) #Sometimes error, sometimes list while ( inherits(m.cv, "try-error") ) { m.cv<- try(cv.glm(d, m, K=2)) } class(m.cv) #always list </code></pre>

R: Cross validation on a dataset with factors

Q: How do you implement cross-validation in R?

K-fold Cross-Validation Split the dataset into K subsets randomly. Use K-1 subsets for training the model. Test the model against that one subset that was left in the previous step. Repeat the above steps for K times i.e., until the model is not trained and tested on all subsets.

Q: How do you do a 10 fold cross-validation in R?

Set the method parameter to “cv” and number parameter to 10. It means that we set the cross-validation with ten folds. We can set the number of the fold with any number, but the most common way is to set it to five or ten. The train() function is used to determine the method we use.

Tags:

r

data-analysis

cross-validation

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y.

For example, using package boot:

library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor x has new levels B

Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C but none of them is present in the training partition.

The function createDataPartition function from the package caret does stratified sampling for the outcome variables and correctly warns:

Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.

There are two solutions which spring to mind:

First, create a subset of the data by selecting one random sample of each factor level first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using createDataPartition on the rest of the dataset and merging the results to create a new train dataset which contains all levels.
Using createDataPartitions and and doing rejection sampling.

So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.

Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.

Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?

Please leave a comment if I should ask this question on stats.stackoverflow.com.

Update:

This is what my hand rolled out solution (2) looks like:

get.cv.idx <- function(train.data, folds, factor.cols = NA) {

    if (is.na(factor.cols)) {
        all.cols        <- colnames(train.data)
        factor.cols     <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
    }

    n                   <- nrow(train.data)
    test.n              <- floor(1 / folds * n)

    cond.met            <- FALSE
    n.tries             <- 0

    while (!cond.met) {
        n.tries         <- n.tries + 1
        test.idx        <- sample(nrow(train.data), test.n)
        train.idx       <- setdiff(1:nrow(train.data), test.idx)

        cond.met        <- TRUE

        for(factor.col in factor.cols) {
            train.levels <- train.data[ train.idx, factor.col ]
            test.levels  <- train.data[ test.idx , factor.col ]
            if (length(unique(train.levels)) < length(unique(test.levels))) {
                cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
                cond.met <- FALSE
            }
        }
    }

    cat('Done in ', n.tries, ' trie(s).\n')

    list( train.idx = train.idx
        , test.idx  = test.idx
        )
}

996

asked Nov 13 '13 06:11

musically_ut

1 Answers

Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.

m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list

I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.

library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)

m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list

134

answered Oct 03 '22 14:10

Pierre Lapointe

Related questions
                            
                                default R personal library location is null
                            
                                Error creating notebook: non-numeric argument to binary operator; RStudio
                            
                                Reticulate not sharing state between R/Python cells or Python/Python cells in RMarkdown
                            
                                R equivalent of Python's dask
                            
                                curve3d can't find local function "fn"
                            
                                Inspect S4 methods
                            
                                How to tidy up R code?
                            
                                mtext() to add horizontal y labels
                            
                                Redirect/intercept function calls within a package function
                            
                                Subsetting a dataframe for a specified month and year
                            
                                What does "hidden list" in the output of `str()` mean?
                            
                                Creating a continuous heat map in R
                            
                                positioning horizontal boxplots in ggplot2
                            
                                Difference between R.loess and org.apache.commons.math LoessInterpolator
                            
                                Getting .Rprofile to Load at Startup
                            
                                Using in line r code as part of a R markdown header
                            
                                How to minimize a function over one input parameter in R
                            
                                Removing non-English text from Corpus in R using tm()
                            
                                Is there an R equivalent of other languages triple quotes?
                            
                                Finding ngrams in R and comparing ngrams across corpora

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With