I have been exploring the new recipes
package for variable transformations as part of a machine learning pipeline. I opted for this approach - upgrading from using caret
's preProcess
function, due to all the new extensions. But I am finding that the packages give very different results for the transformed data:
library(caret) # V6.0-79
library(recipes) # V0.1.2
library(MASS) # V7.3-47
# transform variables using recipes
rec_box <- recipe(~ ., data = as.data.frame(state.x77)) %>%
step_BoxCox(., everything()) %>%
prep(., training = as.data.frame(state.x77)) %>%
bake(., as.data.frame(state.x77))
> head(rec_box)
# A tibble: 6 x 8
Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.19 138. 0.647 60171653. 6.89 651. 20. 56.0
2 5.90 185. 0.376 61218586. 5.52 1632. 152. 106.
3 7.70 155. 0.527 66409311. 4.08 1253. 15. 69.4
4 7.65 133. 0.570 66885876. 5.05 609. 65. 56.4
5 9.96 165. 0.0936 71570875. 5.13 1445. 20. 75.5
6 7.84 161. -0.382 73188251. 3.62 1503. 166. 67.7
# transform variables using preProcess
pre_box <- preProcess(x = as.data.frame(state.x77), method = c('BoxCox')) %>%
predict(. ,newdata = as.data.frame(state.x77))
> head(pre_box)
# A tibble: 6 x 8
Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.19 118. 0.642 2383. 6.83 618. 20. 38.7
2 5.90 157. 0.374 2401. 5.47 1538. 152. 65.7
3 7.70 133. 0.524 2488. 4.05 1183. 15. 46.3
4 7.65 114. 0.566 2496. 5.01 579. 65. 38.9
5 9.96 141. 0.0935 2571. 5.09 1363. 20. 49.7
6 7.84 138. -0.383 2596. 3.60 1418. 166. 45.4
## Subtract recipe transformations from MARS::boxcox via caret::preProcess
colMeans(rec_box - pre_box)
> colMeans(rec_box - pre_box)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
0.000000e+00 2.215800e+01 2.515464e-03 6.803437e+07 2.638715e-02 5.883549e+01 0.000000e+00 1.745788e+01
So it would seem that on some columns they do agree, but others are way different. Any reason why these transformations might be so very different? Anyone else been finding the similar discrepancies?
The difference is due to rounding of lambdas
in the preProcess
function which are rounded to one decimal place.
Check this example:
library(caret)
library(recipes)
library(MASS)
library(mlbench)
data(Sonar)
df <- Sonar[,-61]
using the preProcess
function and setting fudge
to 0 (no tolerance for 0/1 coercion of lambdas).
z2 <- preProcess(x = as.data.frame(df), method = c('BoxCox'), fudge = 0)
and using recepies
:
z <- recipe(~ ., data = as.data.frame(df )) %>%
step_BoxCox(., everything()) %>%
prep(., training = as.data.frame(df))
lets check the lambdas for recepies
:
z$steps[[1]]$lambdas
#output
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0.09296796 0.23383117 0.19487939 0.11471259 0.18688851 0.35852835 0.48787887 0.36830343 0.26340880 0.29810673 0.33913896 0.50361765
V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24
0.49178396 0.35997958 0.43900093 0.28981749 0.22843441 0.27016373 0.50573719 0.83436868 1.02366629 1.15194335 1.35062142 1.44484148
V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36
1.51851127 1.61365888 1.47445453 1.44448827 1.22132457 1.00145613 0.66343491 0.61951328 0.53028496 0.45278118 0.39019507 0.37536033
V37 V38 V39 V40 V41 V42 V52 V53 V54 V55 V56 V57
0.28428050 0.23439217 0.29554367 0.47263000 0.34455069 0.44036919 0.15240917 0.30314637 0.28647186 0.16202628 0.27153385 0.17005357
V58 V59 V60
0.15688906 0.28761156 0.06652761
and the lambdas for preProcess
:
sapply(z2$bc, function(x) x$lambda)
#output
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34
0.1 0.2 0.2 0.1 0.2 0.4 0.5 0.4 0.3 0.3 0.3 0.5 0.5 0.4 0.4 0.3 0.2 0.3 0.5 0.8 1.0 1.2 1.4 1.4 1.5 1.6 1.5 1.4 1.2 1.0 0.7 0.6 0.5 0.5
V35 V36 V37 V38 V39 V40 V41 V42 V52 V53 V54 V55 V56 V57 V58 V59 V60
0.4 0.4 0.3 0.2 0.3 0.5 0.3 0.4 0.2 0.3 0.3 0.2 0.3 0.2 0.2 0.3 0.1
So:
df$V1^z$steps[[1]]$lambdas[1]
is not equal to
df$V1^sapply(z2$bc, function(x) x$lambda)[1]
With default fudge = 0.2
the difference will be even higher since -0.2 - 02
will be changed to 0
ie log
transformation while 0.8 - 1.2
lambdas will be changed to 1
- no transformation.
I would not concern myself with these differences both functions will reduce the skewness of data. Just don't mix them in the same training pipeline.
Also to get more unbiased estimates of performance these transformations should be performed during re-sampling and not prior it to avoid data leakage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With