Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear regression in R (normal and logarithmic data)

I want to carry out a linear regression in R for data in a normal and in a double logarithmic plot.

For normal data the dataset might be the follwing:

lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
plot (lin$x, lin$y)

There I want to calculate draw a line for the linear regression only of the datapoints 2, 3 and 4.

For double logarithmic data the dataset might be the following:

data = data.frame(
    x=c(1:15),
    y=c(
        1.000, 0.742, 0.623, 0.550, 0.500, 0.462, 0.433,
        0.051, 0.043, 0.037, 0.032, 0.028, 0.025, 0.022, 0.020
      )
    )
plot (data$x, data$y, log="xy")

Here I want to draw the regression line for the datasets 1:7 and for 8:15.

Ho can I calculate the slope and the y-offset als well as parameters for the fit (R^2, p-value)?

How is it done for normal and for logarithmic data?

Thanks for you help,

Sven

like image 563
R_User Avatar asked Jun 08 '11 10:06

R_User


2 Answers

In R, linear least squares models are fitted via the lm() function. Using the formula interface we can use the subset argument to select the data points used to fit the actual model, for example:

lin <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
linm <- lm(y ~ x, data = lin, subset = 2:4)

giving:

R> linm

Call:
lm(formula = y ~ x, data = lin, subset = 2:4)

Coefficients:
(Intercept)            x  
     -1.633        1.500  

R> fitted(linm)
         2          3          4 
-0.1333333  1.3666667  2.8666667

As for the double log, you have two choices I guess; i) estimate two separate models as we did above, or ii) estimate via ANCOVA. The log transformation is done in the formula using log().

Via two separate models:

logm1 <- lm(log(y) ~ log(x), data = dat, subset = 1:7)
logm2 <- lm(log(y) ~ log(x), data = dat, subset = 8:15)

Or via ANCOVA, where we need an indicator variable

dat <- transform(dat, ind = factor(1:15 <= 7))
logm3 <- lm(log(y) ~ log(x) * ind, data = dat)

You might ask if these two approaches are equivalent? Well they are and we can show this via the model coefficients.

R> coef(logm1)
  (Intercept)        log(x) 
-0.0001487042 -0.4305802355 
R> coef(logm2)
(Intercept)      log(x) 
  0.1428293  -1.4966954

So the two slopes are -0.4306 and -1.4967 for the separate models. The coefficients for the ANCOVA model are:

R> coef(logm3)
   (Intercept)         log(x)        indTRUE log(x):indTRUE 
     0.1428293     -1.4966954     -0.1429780      1.0661152

How do we reconcile the two? Well the way I set up ind, logm3 is parametrised to give more directly values estimated from logm2; the intercepts of logm2 and logm3 are the same, as are the coefficients for log(x). To get the values equivalent to the coefficients of logm1, we need to do a manipulation, first for the intercept:

R> coefs[1] + coefs[3]
  (Intercept) 
-0.0001487042

where the coefficient for indTRUE is the difference in the mean of group 1 over the mean of group 2. And for the slope:

R> coefs[2] + coefs[4]
    log(x) 
-0.4305802

which is the same as we got for logm1 and is based on the slope for group 2 (coefs[2]) modified by the difference in slope for group 1 (coefs[4]).

As for plotting, an easy way is via abline() for simple models. E.g. for the normal data example:

plot(y ~ x, data = lin)
abline(linm)

For the log data we might need to be a bit more creative, and the general solution here is to predict over the range of data and plot the predictions:

pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1), 
                                     by = 0.1))
pdat <- transform(pdat, yhat = c(predict(logm1, pdat[1:70,, drop = FALSE]), 
                                 predict(logm2, pdat[71:141,, drop = FALSE])))

Which can plot on the original scale, by exponentiating yhat

plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")

or on the log scale:

plot(log(y) ~ log(x), data = dat)
lines(yhat ~ log(x), dat = pdat, subset = 1:70, col = "red")
lines(yhat ~ log(x), dat = pdat, subset = 71:141, col = "blue")

For example...

This general solution works well for the more complex ANCOVA model too. Here I create a new pdat as before and add in an indicator

pdat <- with(dat, data.frame(x = seq(from = head(x, 1), to = tail(x,1), 
                                     by = 0.1)[1:140],
                             ind = factor(rep(c(TRUE, FALSE), each = 70))))
pdat <- transform(pdat, yhat = predict(logm3, pdat))

Notice how we get all the predictions we want from the single call to predict() because of the use of ANCOVA to fit logm3. We can now plot as before:

plot(y ~ x, data = dat)
lines(exp(yhat) ~ x, dat = pdat, subset = 1:70, col = "red")
lines(exp(yhat) ~ x, dat = pdat, subset = 71:141, col = "blue")
like image 192
Gavin Simpson Avatar answered Sep 27 '22 18:09

Gavin Simpson


#Split the data into two groups
data1 <- data[1:7, ]
data2 <- data[8:15, ]

#Perform the regression
model1 <- lm(log(y) ~ log(x), data1)
model2 <- lm(log(y) ~ log(x), data2)
summary(model1)
summary(model2)

#Plot it
with(data, plot(x, y, log="xy"))
lines(1:7, exp(predict(model1, data.frame(x = 1:7))))
lines(8:15, exp(predict(model2, data.frame(x = 8:15))))

In general, splitting the data into different groups and running different models on different subsets is unusual, and probably bad form. You may want to consider adding a grouping variable

data$group <- factor(rep(letters[1:2], times = 7:8))

and running some sort of model on the whole dataset, e.g.,

model_all <- lm(log(y) ~ log(x) * group, data)
summary(model_all)
like image 37
Richie Cotton Avatar answered Sep 27 '22 16:09

Richie Cotton