Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R randomForest for classification

I am trying to do classification with randomForest, but I am repeatedly getting an error message for which there seems to be no apparent solution (randomForest has worked well for me doing regression in the past). I have pasted my code below. 'success' is a factor, all of the dependent variables are numbers. Any suggestions as to how to run this classification properly?

> rf_model<-randomForest(success~.,data=data.train,xtest=data.test[,2:9],ytest=data.test[,1],importance=TRUE,proximity=TRUE)

Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)

also, here is a sample of the dataset:

head(data)

success duration  goal reward_count updates_count comments_count backers_count     min_reward_level max_reward_level
True 20.00000  1500           10            14              2            68                1             1000
True 30.00000  3000           10             4              3            48                5             1000
True 24.40323 14000           23             6             10           540                5             1250
True 31.95833 30000            9            17              7           173                1            10000
True 28.13211  4000           10            23             97          2936               10              550
True 30.00000  6000           16            16            130          2043               25              500
like image 428
user1799242 Avatar asked Jan 03 '13 16:01

user1799242


People also ask

Can random forest Regressor be used for classification?

Random forest is an ensemble of decision tree algorithms. It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for classification and regression problems.

Can I use random forest for classification?

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm.


1 Answers

Apart from the obvious facts around presence of NAs etc. this error is almost always caused by the presence of Character feature types in the data set. The way to understand this is by considering what random forest really does. You are partitioning the data set feature by feature. So if one of the feature is a Character vector, how would you partition the data set? You need categories to partition a data. How many 'male' vs. 'female' - categories...

For numeric features like Age, or price, you can create categories by bucketing; greater than certain age, lesser than certain price etc. You cannot do that with pure character features. Therefore you need them as factors in your data set.

like image 150
Kingz Avatar answered Sep 20 '22 06:09

Kingz