Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C5.0 decision tree - c50 code called exit with value 1

I am getting the following error

c50 code called exit with value 1

I am doing this on the titanic data available from Kaggle

# Importing datasets
train <- read.csv("train.csv", sep=",")

# this is the structure
  str(train)

Output :-

    'data.frame':   891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Then I tried using C5.0 dtree

# Trying with C5.0 decision tree
library(C50)

#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)

new_model <- C5.0(train[-2],train$Survived)

So running the above lines gives me this error

c50 code called exit with value 1

I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?

-Thanks

like image 712
zephyr Avatar asked Apr 02 '14 06:04

zephyr


3 Answers

Here is what worked finally:-

Got this idea after reading this post

library(C50)

test$Survived <- NA

combinedData <- rbind(train,test)

combinedData$Survived <- factor(combinedData$Survived)

# fixing empty character level names 
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"

new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]

new_model <- C5.0(new_train[,-2],new_train$Survived)

new_model_predict <- predict(new_model,new_test)

submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)

The intuition behind this is that in this way both the train and test data set will have consistent factor levels.

like image 141
zephyr Avatar answered Oct 16 '22 15:10

zephyr


For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.

Regarding your problem, first of I think you meant to write

new_model <- C5.0(train[,-2],train$Survived)

Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that

levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"

your algorithm will now run without an error.

like image 15
Marco Avatar answered Nov 14 '22 20:11

Marco


Just in case. You can take a look to the error by

summary(new_model)

Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.

like image 8
Rustam Guliev Avatar answered Nov 14 '22 20:11

Rustam Guliev