Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Forest by R package party overfits on random data

I am working on Random Forest classification.

I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.

A toy example

Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().

# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")

Perform cforest, employing unbiased parameter set (maybe recommended).

cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
#       FALSE TRUE
# FALSE    45    7
# TRUE      6   42

Fairly good prediction performance on meaningless data.

On the other hand, randomForest goes honestly.

rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
#       FALSE TRUE
# FALSE    25   27
# TRUE     26   22

Where these differences come from?
I am afraid that I am using cforest in a wrong way.

Let me put some extra observations in cforest:

  1. The number of variables did not much affect the result.
  2. Variable importance values (computed by varimp(cf)) were rather low, compared to those using some realistic explanatory variables.
  3. AUC of ROC curve was nearly 1.

I would appreciate your advices.

Additional note

Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

like image 775
dytori Avatar asked Oct 23 '13 12:10

dytori


2 Answers

You cannot learn anything about the true performance of a classifier by studying its performance on the training set. Moreover, since there is no true pattern to find you can't really tell if it is worse to overfit like cforest did, or to guess randomly like randomForest did. All you can tell is that the two algorithms followed different strategies, but if you'd test them on new unseen data both would probably fail.

The only way to estimate the performance of a classifier is to test it on external data, that has not been part of the training, in a situation you do know there is a pattern to find.

Some comments:

  1. The number of variables shouldn't matter if none contain any useful information.
  2. Nice to see that the variable importance is lower for meaningless data than meaningful data. This could serve as a sanity check for the method, but probably not much more.
  3. AUC (or any other performance measure) doesn't matter on the training set, since it is trivial to obtain perfect classification results.
like image 141
Backlin Avatar answered Oct 20 '22 22:10

Backlin


The predict methods have different defaults for cforest and randomForest models, respectively. party:::predict.RandomForest gets you

function (object, OOB = FALSE, ...) 
    {
        RandomForest@predict(object, OOB = OOB, ...)
    }

so

table(predict(cf), data$response)

gets me

        FALSE TRUE
  FALSE    45   13
  TRUE      7   35

whereas

table(predict(cf, OOB=TRUE), data$response)

gets me

        FALSE TRUE
  FALSE    31   24
  TRUE     21   24

which is a respectably dismal result.

like image 36
Noah Avatar answered Oct 20 '22 20:10

Noah