I am working on Random Forest classification.
I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.
Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().
# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")
Perform cforest, employing unbiased parameter set (maybe recommended).
cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
# FALSE TRUE
# FALSE 45 7
# TRUE 6 42
Fairly good prediction performance on meaningless data.
On the other hand, randomForest goes honestly.
rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
# FALSE TRUE
# FALSE 25 27
# TRUE 26 22
Where these differences come from?
I am afraid that I am using cforest in a wrong way.
Let me put some extra observations in cforest:
I would appreciate your advices.
Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
You cannot learn anything about the true performance of a classifier by studying its performance on the training set. Moreover, since there is no true pattern to find you can't really tell if it is worse to overfit like cforest
did, or to guess randomly like randomForest
did. All you can tell is that the two algorithms followed different strategies, but if you'd test them on new unseen data both would probably fail.
The only way to estimate the performance of a classifier is to test it on external data, that has not been part of the training, in a situation you do know there is a pattern to find.
Some comments:
The predict
methods have different defaults for cforest
and randomForest
models, respectively. party:::predict.RandomForest
gets you
function (object, OOB = FALSE, ...)
{
RandomForest@predict(object, OOB = OOB, ...)
}
so
table(predict(cf), data$response)
gets me
FALSE TRUE
FALSE 45 13
TRUE 7 35
whereas
table(predict(cf, OOB=TRUE), data$response)
gets me
FALSE TRUE
FALSE 31 24
TRUE 21 24
which is a respectably dismal result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With