Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classification - Usage of factor levels

I am currently working on a predictive model for a churn problem.
Whenever I try to run the following model, I get this error: At least one of the class levels is not a valid R variable name. This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1. Please use factor levels that can be used as valid R variable names.

fivestats <- function(...) c( twoClassSummary(...), defaultSummary(...))
fitControl.default    <- trainControl( 
    method  = "repeatedcv"
  , number  = 10
  , repeats = 1 
  , verboseIter = TRUE
  , summaryFunction  = fivestats
  , classProbs = TRUE
  , allowParallel = TRUE)
set.seed(1984)

rpartGrid             <-  expand.grid(cp = seq(from = 0, to = 0.1, by = 0.001))
rparttree.fit.roc <- train( 
    churn ~ .
  , data      = training.dt  
  , method    = "rpart"
  , trControl = fitControl.default
  , tuneGrid  = rpartGrid
  , metric = 'ROC'
  , maximize = TRUE
)

In the attached picture you see my data, I already transformed some data from chr to factor variable.

DATA OVERVIEW

I do not get what my problem is, if I would transform the entire data into factors, then for instance the variable total_airtime_out will probably have around 9000 factors.

Thanks for any kind of help!

like image 456
Simon Avatar asked May 20 '17 10:05

Simon


People also ask

What are factor levels?

Factor levels are all of the values that the factor can take (recall that a categorical variable has a set number of groups). In a designed experiment, the treatments represent each combination of factor levels. If there is only one factor with k levels, then there would be k treatments.

What is a factor variable and why would you use one?

Factor variables are categorical variables that can be either numeric or string variables. There are a number of advantages to converting categorical variables to factor variables.

What is the use of factors in R?

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.

How many levels are in a factor?

The number of levels of a factor or independent variable is equal to the number of variations of that factor that were used in the experiment. If an experiment compared the drug dosages 50 mg, 100 mg, and 150 mg, then the factor "drug dosage" would have three levels: 50 mg, 100 mg, and 150 mg.


2 Answers

It's not exactly possible for me to reproduce your error, but my educated guess is that the error message tells you everything you need to know:

At least one of the class levels is not a valid R variable name. This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1. Please use factor levels that can be used as valid R variable names.

Emphasis mine. Looking at your response variable, its levels are "0" and "1", these aren't valid variable names in R (you can't do 0 <- "my value"). Presumably this problem will go away if you rename the levels of the response variable with something like

levels(training.dt$churn) <- c("first_class", "second_class")

as per this Q.

like image 156
einar Avatar answered Oct 28 '22 13:10

einar


How about this base function:

 make.names(churn) ~ .,

to "make syntactically valid names out of character vectors"?

Source

like image 25
Dbercules Avatar answered Oct 28 '22 12:10

Dbercules