I am currently working on a predictive model for a churn problem.
Whenever I try to run the following model, I get this error: At least one of the class levels is not a valid R variable name. This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1. Please use factor levels that can be used as valid R variable names.
fivestats <- function(...) c( twoClassSummary(...), defaultSummary(...))
fitControl.default <- trainControl(
method = "repeatedcv"
, number = 10
, repeats = 1
, verboseIter = TRUE
, summaryFunction = fivestats
, classProbs = TRUE
, allowParallel = TRUE)
set.seed(1984)
rpartGrid <- expand.grid(cp = seq(from = 0, to = 0.1, by = 0.001))
rparttree.fit.roc <- train(
churn ~ .
, data = training.dt
, method = "rpart"
, trControl = fitControl.default
, tuneGrid = rpartGrid
, metric = 'ROC'
, maximize = TRUE
)
In the attached picture you see my data, I already transformed some data from chr to factor variable.
I do not get what my problem is, if I would transform the entire data into factors, then for instance the variable total_airtime_out will probably have around 9000 factors.
Thanks for any kind of help!
Factor levels are all of the values that the factor can take (recall that a categorical variable has a set number of groups). In a designed experiment, the treatments represent each combination of factor levels. If there is only one factor with k levels, then there would be k treatments.
Factor variables are categorical variables that can be either numeric or string variables. There are a number of advantages to converting categorical variables to factor variables.
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.
The number of levels of a factor or independent variable is equal to the number of variations of that factor that were used in the experiment. If an experiment compared the drug dosages 50 mg, 100 mg, and 150 mg, then the factor "drug dosage" would have three levels: 50 mg, 100 mg, and 150 mg.
It's not exactly possible for me to reproduce your error, but my educated guess is that the error message tells you everything you need to know:
At least one of the class levels is not a valid R variable name. This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1. Please use factor levels that can be used as valid R variable names.
Emphasis mine. Looking at your response variable, its levels are "0"
and "1"
, these aren't valid variable names in R (you can't do 0 <- "my value"
). Presumably this problem will go away if you rename the levels of the response variable with something like
levels(training.dt$churn) <- c("first_class", "second_class")
as per this Q.
How about this base function:
make.names(churn) ~ .,
to "make syntactically valid names out of character vectors"?
Source
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With