I need to transform some data into a 'normal shape' and I read that Box-Cox can identify the exponent to use to transform the data. For what I understood <pre class="prettyprint"><code>car::boxCoxVariable(y) </code></pre> is used for response variables in linear models, and <pre class="prettyprint"><code>MASS::boxcox(object) </code></pre> for a formula or fitted model object. So, because my data are the variable of a dataframe, the only function I found I could use is: <pre class="prettyprint"><code>car::powerTransform(dataframe$variable, family="bcPower") </code></pre> Is that correct? Or am I missing something? The second question is about what to do after I obtain the <pre class="prettyprint"><code>Estimated transformation parameters dataframe$variable 0.6394806 </code></pre> Should I simply multiply the variable by this value? I did so: <pre class="prettyprint"><code>aaa = 0.6394806 dataframe$variable2 = (dataframe$variable)*aaa </code></pre> and then I run the shapiro-wilks test for normality, but again my data don't seem to follow a normal distribution: <pre class="prettyprint"><code>shapiro.test(dataframe$variable2) data: dataframe$variable2 W = 0.97508, p-value < 2.2e-16 </code></pre>

According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", I think mlegge's post might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.) So, the code should be: <pre class="prettyprint"><code>(trans <- bc$x[which.max(bc$y)]) [1] 0.4242424 # re-run with transformation mnew <- lm(((y^trans-1)/trans) ~ x) # Instead of mnew <- lm(y^trans ~ x) </code></pre> <h3>More information</h3> <ul> <li>Correct implementation of Box-Cox transformation formula by boxcox() in R: https://www.r-bloggers.com/on-box-cox-transform-in-regression-models/</li> <li>A great comparison between Box-Cox transformation and Tukey transformation. http://onlinestatbook.com/2/transformations/box-cox.html</li> <li>One could also find the Box-Cox transformation formula on Wikipedia: en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation</li> </ul> Please correct me if I misunderstood it.

Box and Cox (1964) suggested a family of transformations designed to reduce nonnormality of the errors in a linear model. In turns out that in doing this, it often reduces non-linearity as well. Here is a nice summary of the original work and all the work that's been done since: http://www.ime.usp.br/~abe/lista/pdfm9cJKUmFZp.pdf You will notice, however, that the log-likelihood function governing the selection of the lambda power transform is dependent on the residual sum of squares of an underlying model (no LaTeX on SO -- see the reference), so no transformation can be applied without a model. A typical application is as follows: <pre class="prettyprint"><code>library(MASS) # generate some data set.seed(1) n <- 100 x <- runif(n, 1, 5) y <- x^3 + rnorm(n) # run a linear model m <- lm(y ~ x) # run the box-cox transformation bc <- boxcox(y ~ x) </code></pre> <img src="https://i.stack.imgur.com/X7Jjh.png" alt="enter image description here"> <pre class="prettyprint"><code>(lambda <- bc$x[which.max(bc$y)]) [1] 0.4242424 powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") { boxcoxTrans <- function(x, lam1, lam2 = NULL) { # if we set lambda2 to zero, it becomes the one parameter transformation lam2 <- ifelse(is.null(lam2), 0, lam2) if (lam1 == 0L) { log(y + lam2) } else { (((y + lam2)^lam1) - 1) / lam1 } } switch(method , boxcox = boxcoxTrans(y, lambda1, lambda2) , tukey = y^lambda1 ) } # re-run with transformation mnew <- lm(powerTransform(y, lambda) ~ x) # QQ-plot op <- par(pty = "s", mfrow = c(1, 2)) qqnorm(m$residuals); qqline(m$residuals) qqnorm(mnew$residuals); qqline(mnew$residuals) par(op) </code></pre> <img src="https://i.stack.imgur.com/w4Uz2.png" alt="enter image description here"> As you can see this is no magic bullet -- only some data can be effectively transformed (usually a lambda less than -2 or greater than 2 is a sign you should not be using the method). As with any statistical method, use with caution before implementing. To use the two parameter Box-Cox transformation, use the <code>geoR</code> package to find the lambdas: <pre class="prettyprint"><code>library("geoR") bc2 <- boxcoxfit(x, y, lambda2 = TRUE) lambda1 <- bc2$lambda[1] lambda2 <- bc2$lambda[2] </code></pre> EDITS: Conflation of Tukey and Box-Cox implementation as pointed out by @Yui-Shiuan fixed.

how to use the Box-Cox power transformation in R

Tags:

r

regression

transformation

I need to transform some data into a 'normal shape' and I read that Box-Cox can identify the exponent to use to transform the data.

For what I understood

car::boxCoxVariable(y)

is used for response variables in linear models, and

MASS::boxcox(object)

for a formula or fitted model object. So, because my data are the variable of a dataframe, the only function I found I could use is:

car::powerTransform(dataframe$variable, family="bcPower")

Is that correct? Or am I missing something?

The second question is about what to do after I obtain the

Estimated transformation parameters dataframe$variable 0.6394806

Should I simply multiply the variable by this value? I did so:

aaa = 0.6394806 dataframe$variable2 = (dataframe$variable)*aaa

and then I run the shapiro-wilks test for normality, but again my data don't seem to follow a normal distribution:

shapiro.test(dataframe$variable2) data:  dataframe$variable2 W = 0.97508, p-value < 2.2e-16

375

asked Nov 30 '15 13:11

dede

2 Answers

According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", I think mlegge's post might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.)
So, the code should be:

(trans <- bc$x[which.max(bc$y)]) [1] 0.4242424 # re-run with transformation mnew <- lm(((y^trans-1)/trans) ~ x) # Instead of mnew <- lm(y^trans ~ x)

More information

Correct implementation of Box-Cox transformation formula by boxcox() in R:
https://www.r-bloggers.com/on-box-cox-transform-in-regression-models/
A great comparison between Box-Cox transformation and Tukey transformation. http://onlinestatbook.com/2/transformations/box-cox.html
One could also find the Box-Cox transformation formula on Wikipedia: en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation

Please correct me if I misunderstood it.

answered Sep 30 '22 23:09

Sean Yun-Shiuan Chuang

Box and Cox (1964) suggested a family of transformations designed to reduce nonnormality of the errors in a linear model. In turns out that in doing this, it often reduces non-linearity as well.

Here is a nice summary of the original work and all the work that's been done since: http://www.ime.usp.br/~abe/lista/pdfm9cJKUmFZp.pdf

You will notice, however, that the log-likelihood function governing the selection of the lambda power transform is dependent on the residual sum of squares of an underlying model (no LaTeX on SO -- see the reference), so no transformation can be applied without a model.

A typical application is as follows:

library(MASS)  # generate some data set.seed(1) n <- 100 x <- runif(n, 1, 5) y <- x^3 + rnorm(n)  # run a linear model m <- lm(y ~ x)  # run the box-cox transformation bc <- boxcox(y ~ x)

enter image description here

(lambda <- bc$x[which.max(bc$y)]) [1] 0.4242424  powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {    boxcoxTrans <- function(x, lam1, lam2 = NULL) {      # if we set lambda2 to zero, it becomes the one parameter transformation     lam2 <- ifelse(is.null(lam2), 0, lam2)      if (lam1 == 0L) {       log(y + lam2)     } else {       (((y + lam2)^lam1) - 1) / lam1     }   }    switch(method          , boxcox = boxcoxTrans(y, lambda1, lambda2)          , tukey = y^lambda1   ) }   # re-run with transformation mnew <- lm(powerTransform(y, lambda) ~ x)  # QQ-plot op <- par(pty = "s", mfrow = c(1, 2)) qqnorm(m$residuals); qqline(m$residuals) qqnorm(mnew$residuals); qqline(mnew$residuals) par(op)

enter image description here

As you can see this is no magic bullet -- only some data can be effectively transformed (usually a lambda less than -2 or greater than 2 is a sign you should not be using the method). As with any statistical method, use with caution before implementing.

To use the two parameter Box-Cox transformation, use the geoR package to find the lambdas:

library("geoR") bc2 <- boxcoxfit(x, y, lambda2 = TRUE)  lambda1 <- bc2$lambda[1] lambda2 <- bc2$lambda[2]

EDITS: Conflation of Tukey and Box-Cox implementation as pointed out by @Yui-Shiuan fixed.

answered Sep 30 '22 21:09

mlegge

Related questions
                            
                                Is there a way to use two '...' statements in a function in R?
                            
                                Aesthetics must either be length one, or the same length as the dataProblems
                            
                                How can I match fuzzy match strings from two datasets?
                            
                                Renaming Objects in RStudio context sensitive within entire Project
                            
                                R Markdown Bullet List with Multiple Levels
                            
                                How to highlight time ranges on a plot?
                            
                                Output in R, Avoid Writing "[1]"
                            
                                How can I stop a running R command in linux other than with Ctrl + C?
                            
                                How to convert dataframe into time series?
                            
                                Categorize continuous variable with dplyr [duplicate]
                            
                                R self reference
                            
                                figure captions, references using knitr and markdown to html
                            
                                What are the double colons (::) in R?
                            
                                Why can't I get a p-value smaller than 2.2e-16?
                            
                                R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long
                            
                                How to skip error checking at Rmarkdown compiling?
                            
                                Get row and column indices of matches using `which()`
                            
                                Using identical() in R with multiple vectors
                            
                                Use of lapply .SD in data.table R
                            
                                Sample random rows within each group in a data.table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With