C5.0 decision tree - c50 code called exit with value 1

Question

I am getting the following error

c50 code called exit with value 1

I am doing this on the titanic data available from Kaggle

# Importing datasets
train <- read.csv("train.csv", sep=",")

# this is the structure
  str(train)

Output :-

    'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Then I tried using C5.0 dtree

# Trying with C5.0 decision tree
library(C50)

#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)

new_model <- C5.0(train[-2],train$Survived)

So running the above lines gives me this error

c50 code called exit with value 1

I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?

-Thanks

zephyr · Accepted Answer

Here is what worked finally:-

Got this idea after reading this post

library(C50)

test$Survived <- NA

combinedData <- rbind(train,test)

combinedData$Survived <- factor(combinedData$Survived)

# fixing empty character level names 
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"

new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]

new_model <- C5.0(new_train[,-2],new_train$Survived)

new_model_predict <- predict(new_model,new_test)

submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)

The intuition behind this is that in this way both the train and test data set will have consistent factor levels.

Marco · Answer

For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.

Regarding your problem, first of I think you meant to write

new_model <- C5.0(train[,-2],train$Survived)

Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that

levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"

your algorithm will now run without an error.

Rustam Guliev · Answer

Just in case. You can take a look to the error by

summary(new_model)

Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.

C5.0 decision tree - c50 code called exit with value 1

Tags:

r

machine-learning

decision-tree

kaggle

zephyr

3 Answers

zephyr

Marco

Rustam Guliev

Recent Activity

Donate For Us

C5.0 decision tree - c50 code called exit with value 1

Tags:

r

machine-learning

decision-tree

kaggle

zephyr

3 Answers

zephyr

Marco

Rustam Guliev

Related questions

Recent Activity

Donate For Us