Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better fits for a linear model

I am fitting some lines and I feel like I am telling R exactly how to fit them, but I feel like there is something (some factor or effect) I am unaware of that is preventing a good fit.

My experimental unit is "plot" as in field plot, which I am sorry is confusing.

The data can be found: https://www.dropbox.com/s/a0tplyvs8lxu1d0/rootmeansv2.csv . with

df$plot.f<-as.factor(df$plot)
dfG<-groupedData(mass ~ year|plot.f, data=df)
dfG30<-dfG[dfG$depth == 30,]

Simply, I have mass over time and I fit it to each experimental unit with the model:

fit <- lme(mass ~ year , random = ~ 1 | plot, data = df)

and with plot (augPred(fit)) I get these fits for each experimental unit ("plot"):

What do I need to do to let the slope vary more between experimental units? I am not interested in this from a statistical perspective, but from a predictive perspective -- so anything in the model can be manipulated to get those lines to move.

like image 778
Nazer Avatar asked Dec 05 '22 06:12

Nazer


1 Answers

This answer is going to be a bit encyclopedic, because there are several important points about your question. tl;dr adding year*plot as a random effect is the first step, but the fit is actually a bit problematic, although it doesn't appear so at first: centering the year variable takes care of the problem, but log-transforming the data might be an even better idea.

UPDATE: OP was doing an analysis on subset(df,Depth==30). I accidentally did the analysis on the full data set, which may have made things harder -- the heteroscedasticity documented below probably isn't as bad for a subset of the data -- but I'm going to leave it as is for pedagogical purposes (and out of laziness).

Get the data:

df <- read.csv("rootmeansv2.csv")
library(nlme)
gdf <- groupedData( mass ~ year | plot, data=df)

Adding the year-by-plot interaction to the model as a random effect:

fit0 <- lme(mass ~ year , random = ~ year | plot, data = gdf)

However, if we look at the results we see that this doesn't help very much at all -- the year term (among-plot variance in the slope) is tiny.

##             Variance     StdDev       Corr  
## (Intercept) 9.522880e-12 3.085916e-06 (Intr)
## year        7.110616e-08 2.666574e-04 0.32  
## Residual    3.885373e-01 6.233276e-01       

This does sometimes happen (a singular fit), but otherwise the lme summary doesn't indicate in any other way that there might be something wrong. However, if we try to get confidence intervals we see there might be trouble:

 intervals(fit0)
 ##   Error in intervals.lme(fit0) : 
 ##   cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance

The other way to double-check is to fit the same model in lme4.

library(lme4)
VarCorr(fit1 <- lmer(mass ~ year +(year | plot), data = gdf))

##  Groups   Name        Std.Dev.   Corr  
##  plot     (Intercept) 0.66159674       
##           year        0.00059471 -1.000
##  Residual             0.62324403       
## Warning messages:
## 1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
##   Model failed to converge with max|grad| = 0.132739 (tol = 0.002, component 1)
## 2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
##   Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

We get apparently quite different answers, and warnings about convergence (this is in the development version, 1.1-7, which does not suffer from the false-positive warnings reported widely for 1.1-6). It looks like lme4's fit is slightly better:

c(logLik(fit1)-logLik(fit0))
## 0.1775161

One of the potential data problems for complex fits such as mixed models is poor centering and scaling of predictor variables. In this case, because year is a continuous predictor that starts at 2008, the intercept and slopes are extremely highly correlated (the intercept is the predicted value in year 0!). We can fix the problem in this case by centering -- it would also be reasonable to subtract the minimum (i.e. start year at 0 rather than 2008)

c. <- function(x) scale(x,center=TRUE,scale=FALSE)
VarCorr(fit2 <- update(fit1,.~ c.(year) +(c.(year) | plot)))

##  Groups   Name        Std.Dev. Corr 
##  plot     (Intercept) 0.53798       
##           c.(year)    0.10515  1.000
##  Residual             0.59634       

We get more reasonable answers (and no warnings), although we do still have perfectly (now positively) correlated intercepts and slopes -- this just says the model is slightly overfitting the data, but there's not much we can do about this (we could force the correlation to zero by fitting the model ~c.(year)+(c.(year)||plot)), but this has its own problems).

Now try it in lme:

VarCorr(fit3 <- update(fit0,
                       fixed.=~c.(year),
                       random=~c.(year)|plot,
         control=lmeControl(opt="optim")))
## plot = pdLogChol(c.(year)) 
##             Variance   StdDev    Corr  
## (Intercept) 0.28899909 0.5375864 (Intr)
## c.(year)    0.01122497 0.1059479 0.991 
## Residual    0.35559015 0.5963138       

Results are similar (although correlation is only 0.991 instead of 1.0): the difference in log-likelihoods is actually slightly larger, but still small. (The fit is still somewhat problematic -- I had to change optimizers as shown in the lmeControl argument to get here.)

Things now look better:

plot(augPred(fit3))

enter image description here

However, the residual vs fitted plot shows you a problem you should worry about:

plot(fit3)

enter image description here

The funnel shape shows you have a heteroscedasticity problem. The scale-location plot shows it even more clearly:

plot(fit3,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

enter image description here

The most obvious fix is to log-transform the data:

df$logmass <- log10(df$mass)  ## log10() for interpretability
gdfL <- groupedData( logmass ~ year | plot, data=df)
VarCorr(fitL1 <- lme(logmass ~ c.(year) , random = ~ c.(year) | plot,
             data = gdfL))
## plot = pdLogChol(c.(year)) 
##             Variance     StdDev     Corr  
## (Intercept) 0.1842905872 0.42929080 (Intr)
## c.(year)    0.0009702893 0.03114947 -0.607
## Residual    0.1052946948 0.32449144       

The among-year variation is small again, but this time it's probably because it's not needed. There are no warnings from lmer when we fit the equivalent model, and we get identical results; fit is non-singular (no zero variances or perfect correlations); and intervals(fitL1) works.

The predictions look reasonable:

plot(augPred(fitL1))

enter image description here

So does the diagnostic plot:

plot(fitL1)

enter image description here

Although there does appear to be something a bit funny about 2008 (the axes are flipped because lme prefers to plot boxplots horizontally -- factor(year) tells lme to use a boxplot instead of a scatterplot).

plot(fitL1,factor(year)~resid(.))

enter image description here

like image 89
Ben Bolker Avatar answered Dec 12 '22 06:12

Ben Bolker