I have a dataframe df, I am building an machine learning model (C5.0 decision tree) to predict the class of a column (loan_approved):
Structure (not real data):
id occupation income loan_approved
1 business 4214214 yes
2 business 32134 yes
3 business 43255 no
4 sailor 5642 yes
5 teacher 53335 no
6 teacher 6342 no
Process:
Function:
error_free_predict = function(x){
output = tryCatch({
predict(C50_model, newdata = test[x,], type = "class")
}, error = function(e) {
"no"
})
return(output)
}
Applied the predict function:
test <- mutate(test, predicted_class = error_free_predict(1:NROW(test)))
Problem:
id occupation income loan_approved predicted_class
1 business 4214214 yes no
2 business 32134 yes no
3 business 43255 no no
4 sailor 5642 yes no
5 teacher 53335 no no
6 teacher 6342 no no
Question:
I know this is because the test data frame had a new level that was not present in train data, but should not my function work all cases except this?
P.S: did not use sapply because it was too slow
There are two parts of this problem.
So instead of dividing the data randomly in between train and test you can do stratified sampling. Code using data.table
for 70:30 split is :
ind <- total_data[, sample(.I, round(0.3*.N), FALSE),by="occupation"]$V1
train <- total_data[-ind,]
test <- total_data[ind,]
This makes sure any level is divided equally among train and test dataset. So you will not get "new" categorical level in test dataset; which in random splitting case could be there.
Second part of the problem comes when model is in production and it encounters a altogether new variable which was not there in even training or test set. To tackle this problem one can maintain a list of all levels of all categorical variables by using
lvl_cat_var1 <- unique(cat_var1)
and lvl_cat_var2 <- unique(cat_var2)
etc. Then before predict one can check for new level and filter:
new_lvl_data <- total_data[!(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
pred_data <- total_data[(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
then for the default prediction do:
new_lvl_data$predicted_class <- "no"
and full blown prediction for pred_data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With