Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different results with formula and non-formula for caret training

I noticed that using formula and non-formula methods in caret while training produces different results. Also, the time taken for formula method is almost 10x the time taken for the non-formula method. Is this expected ?

> z <- data.table(c1=sample(1:1000,1000, replace=T), c2=as.factor(sample(LETTERS, 1000, replace=T)))

# SYSTEM TIME WITH FORMULA METHOD
# -------------------------------

> system.time(r <- train(c1 ~ ., z, method="rf", importance=T))
   user  system elapsed
376.233   9.241  18.190

> r
1000 samples
   1 predictors

No pre-processing
Resampling: Bootstrap (25 reps)

Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...

Resampling results across tuning parameters:

  mtry  RMSE  Rsquared  RMSE SD  Rsquared SD
  2     295   0.00114   4.94     0.00154
  13    300   0.00113   5.15     0.00151
  25    300   0.00111   5.16     0.00146

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 2.


# SYSTEM TIME WITH NON-FORMULA METHOD
# -------------------------------

> system.time(r <- train(z[,2,with=F], z$c1, method="rf", importance=T))
       user  system elapsed
     34.984   2.977   2.708
    Warning message:
    In randomForest.default(trainX, trainY, mtry = tuneValue$.mtry,  :
  invalid mtry: reset to within valid range
> r
1000 samples
   1 predictors

No pre-processing
Resampling: Bootstrap (25 reps)

Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...

Resampling results

  RMSE  Rsquared  RMSE SD  Rsquared SD
  297   0.00152   6.67     0.00197

Tuning parameter 'mtry' was held constant at a value of 2
like image 779
xbsd Avatar asked Mar 05 '14 13:03

xbsd


1 Answers

You have a categorical predictor with a moderate number of levels. When you use the formula interface, most modeling functions (including train, lm, glm, etc) internally run model.matrix to process the data set. This will create dummy variables from any factor variables. The non-formula interface does not [1].

When you use dummy variables, only one factor level is used in any split. Tree methods handle categorical predictors differently but, when dummy variables are not used, random forest will sort the factor predictors based on their outcome and find a 2-way split of the factor levels [2]. This takes more time.

Max

[1] I hate to be one of those people who says "in my book I show..." but in this case I will. Fig. 14.2 has a good illustration of this process for CART trees.

[2] God, I'm doing it again. The different representations of factors for trees is discussed in section 14.1 and a comparison between the two approaches for one data set is shown in section 14.7

like image 62
topepo Avatar answered Sep 30 '22 16:09

topepo