Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Kaggle Titanic Dataset Random Forest NAs introduced by coercion

Im currently practicing R on the Kaggle using the titanic data set I am using the Random Forest Algorthim

Below is the code

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
                + Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID, 
                data=train, importance=TRUE, ntree=5000)

I am getting the following error

Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion

My data looks like below

$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
$ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1...
$ Age_Bucket : chr  "20-25" "30-40" "25-30" "30-40" ...
$ Fare_Bucket: chr  "<10" "30+" "<10" "30+" ...
$ Title      : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ F_Name     : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
$ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ FamilyID   : chr  "Small" "Small" "Alone" "Small" ...

If i just type the below, I have no coercion issues which as far as i can see is the only place where coercion occurs to create NA values

as.factor(Survived)

Can anyone see the problem

Thank you for your time

like image 677
John Smith Avatar asked May 10 '15 13:05

John Smith


1 Answers

You need to convert your char columns into factors. Factors are treated as integers internally whereas character fields are not. See the following small demonstration:

Data:

df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F)

df$y <- as.factor(df$y)

> str(df)
'data.frame':   26 obs. of  3 variables:
 $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
 $ x1: num  0.457 0.296 0.517 0.478 0.764 ...
 $ x2: chr  "a" "b" "c" "d" ...

Now if I run my randomForest function:

> randomForest(y ~ x1 + x2, data=df)
Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In data.matrix(x) : NAs introduced by coercion

I get the same error you did.

Whereas if I convert the char column into factor:

df$x2 <- as.factor(df$x2)

> randomForest(y ~ x1 + x2, data=df)

Call:
 randomForest(formula = y ~ x1 + x2, data = df) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 61.54%
Confusion matrix:
  0  1 class.error
0 0 16           1
1 0 10           0

It works great!

like image 91
LyzandeR Avatar answered Sep 20 '22 16:09

LyzandeR