Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fit autoregressive poisson mixed model (count time series) in R?

My task is to assess how various environmental variables affect annual population fluctuations. For this, I need to fit poisson autoregressive model for time-series counts:

enter image description here

Where Ni,j is the count of observed individuals at site i in year j, x_{i,j} is environmental variable at site i in year j - these are the input data, and the rest are parameters: \mu_{i,j} is the expected number of individuals at site i in year j, and \gamma_{j} is random effect for each year.

Is it possible to fit such a model in R? I want to avoid fitting it in Bayesian framework since the computation takes way to long (I have to process 5000 of such models) I tried to transform the model for GLM, but once I had to add the random effect (gamma) it is no longer possible.

like image 657
Tomas Avatar asked Mar 19 '14 11:03

Tomas


People also ask

What is autoregression time series?

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.

How do you identify the lags in autoregressive AR model?

Graphical approaches to assessing the lag of an autoregressive model include looking at the ACF and PACF values versus the lag. In a plot of ACF versus the lag, if you see large ACF values and a non-random pattern, then likely the values are serially correlated.

What is lags in AR model?

Lags: When working with time series data, each data point across time is called a lag. Bias or weights: The b values in the example above are weights. They are values that describe the correlation between the input variables and the variable that we are forecasting.


2 Answers

First, let's create some simulated data (all the ad hoc functions at the end of the answer):

set.seed(12345) # updated to T=20 and L=40 for comparative purposes.

T = 20 # number of years
L = 40  # number of sites
N0 = 100 # average initial pop (to simulate data)
sd_env = 0.8 # to simulate the env (assumed mean 0)
env  = matrix(rnorm(T*L, mean=0, sd=sd_env), nrow=T, ncol=L)

# 'real' parameters
alpha  = 0.1
beta   = 0.05
sd     = 0.4
gamma  = rnorm(T-1, mean=0, sd=sd)
mu_ini = log(rpois(n=L, lambda=N0)) #initial means

par_real = list(alpha=alpha, beta=beta, gamma=gamma, 
               sd=sd, mu_ini=mu_ini)

mu = dynamics(par=par_real, x=env, T=T, L=L)

# observed abundances
n = matrix(rpois(length(mu), lambda=mu), nrow=T, ncol=L)

Now, for a given set of parameters, we can simulate the expected number of individuals at each site and year. And since we have the observed number of individuals, we can write the likelihood function for the observations (being Poisson distributed) and add a penalty for the annual deviates in the growth rate (to make it normal distributed). For this, the function dynamics will simulate the model and the function .getLogLike will calculate the objective function. Now we need to optimize the objective function. The parameters to estimate are alpha, beta, the annual deviates (gamma) and the initial expected number of individuals (mu_ini), and maybe sigma.

For the first try, we can provide 0 for all parameters as initial guesses but for the initial expected numbers for which we can use the initial observed abundances (that's the MLE anyway).

fit0 = fitModel0(obs=n, env=env, T=T, L=L)

Optimal parameters: 
      alpha        beta      gamma1      gamma2      gamma3 
 0.28018842  0.05464360 -0.12904373 -0.15795001 -0.04502903 
     gamma4      gamma5      gamma6      gamma7      gamma8 
 0.05045117  0.08435066  0.28864816  0.24111786 -0.80569709 
     gamma9     gamma10     gamma11     gamma12     gamma13 
 0.22786951  0.10326086 -0.50096088 -0.08880594 -0.33392310 
    gamma14     gamma15     gamma16     gamma17     gamma18 
 0.22664634 -0.47028311  0.11782381 -0.16328820  0.04208037 
    gamma19     mu_ini1     mu_ini2     mu_ini3     mu_ini4 
 0.17648808  4.14267523  4.19187205  4.05573114  3.90406443 
    mu_ini5     mu_ini6     mu_ini7     mu_ini8     mu_ini9 
 4.08975038  4.17185883  4.03679049  4.23091760  4.04940333 
   mu_ini10    mu_ini11    mu_ini12    mu_ini13    mu_ini14 
 4.19355333  4.05543081  4.15598515  4.18266682  4.09095730 
   mu_ini15    mu_ini16    mu_ini17    mu_ini18    mu_ini19 
 4.17922360  3.87211968  4.04509178  4.19385641  3.98403521 
   mu_ini20    mu_ini21    mu_ini22    mu_ini23    mu_ini24 
 4.08531659  4.19294203  4.29891769  4.21025211  4.16297457 
   mu_ini25    mu_ini26    mu_ini27    mu_ini28    mu_ini29 
 4.19265543  4.28925869  4.10752810  4.10957212  4.14953247 
   mu_ini30    mu_ini31    mu_ini32    mu_ini33    mu_ini34 
 4.09690570  4.34234547  4.18169575  4.01663339  4.32713905 
   mu_ini35    mu_ini36    mu_ini37    mu_ini38    mu_ini39 
 4.08121891  3.98256819  4.08658375  4.05942834  4.06988174 
   mu_ini40 
 4.05655031

This works, but normally some parameters can be correlated and more difficult to identify from data, so a sequential approach can be used (can read Bolker et al. 2013 for more info). In this case, we increase progresively the number of parameters, improving the initial guess for the optimization at each phase of the calibration. For this example, the first phase only estimate alpha and beta, and using guesses for a linear model of the growth rate and the environment. Then, in the second phase we use the estimates from the first optimization and add the annual deviates as parameters (gamma). Finally, we use the estimates of the second optimization and add the initial expected values as parameters. We add the initial expected values last assuming the initial observed values are already very close and a good guess to start, but on the other side we have no clear idea of the values of the remaining parameters.

fit  = fitModel(obs=n, env=env, T=T, L=L)


Phase 1: alpha and beta only
Optimal parameters: 
     alpha       beta 
0.18172961 0.06323379 

neg-LogLikelihood:  -5023687 
Phase 2: alpha, beta and gamma
Optimal parameters: 
      alpha        beta      gamma1      gamma2      gamma3 
 0.20519928  0.06238850 -0.35908716 -0.21453015 -0.05661066 
     gamma4      gamma5      gamma6      gamma7      gamma8 
 0.18963170  0.17800563  0.34303170  0.28960181 -0.72374927 
     gamma9     gamma10     gamma11     gamma12     gamma13 
 0.28464203  0.16900331 -0.40719047 -0.01292168 -0.25535610 
    gamma14     gamma15     gamma16     gamma17     gamma18 
 0.28806711 -0.38924648  0.19224527 -0.07875934  0.10880154 
    gamma19 
 0.24518786 

neg-LogLikelihood:  -5041345 
Phase 3: alpha, beta, gamma and mu_ini
Optimal parameters: 
        alpha          beta        gamma1        gamma2 
 0.1962334008  0.0545361273 -0.4298024242 -0.1984379386 
       gamma3        gamma4        gamma5        gamma6 
 0.0240318556  0.1909639571  0.1116636126  0.3465693397 
       gamma7        gamma8        gamma9       gamma10 
 0.3478695629 -0.7500599493  0.3600551021  0.1361405398 
      gamma11       gamma12       gamma13       gamma14 
-0.3874453347 -0.0005839263 -0.2305008546  0.2819608670 
      gamma15       gamma16       gamma17       gamma18 
-0.3615273177  0.1792020332 -0.0685681922  0.1203006457 
      gamma19       mu_ini1       mu_ini2       mu_ini3 
 0.2506129351  4.6639314468  4.7205977429  4.5802529032 
      mu_ini4       mu_ini5       mu_ini6       mu_ini7 
 4.4293994068  4.6182382472  4.7039110982  4.5668031666 
      mu_ini8       mu_ini9      mu_ini10      mu_ini11 
 4.7610910879  4.5844180026  4.7226353021  4.5823048717 
     mu_ini12      mu_ini13      mu_ini14      mu_ini15 
 4.6814189824  4.7130039559  4.6135420745  4.7100006841 
     mu_ini16      mu_ini17      mu_ini18      mu_ini19 
 4.4080115751  4.5758092977  4.7209394881  4.5150790425 
     mu_ini20      mu_ini21      mu_ini22      mu_ini23 
 4.6171948847  4.7141188899  4.8303375556  4.7392110431 
     mu_ini24      mu_ini25      mu_ini26      mu_ini27 
 4.6893526309  4.7237687961  4.8234804183  4.6333012324 
     mu_ini28      mu_ini29      mu_ini30      mu_ini31 
 4.6392335265  4.6817044754  4.6260620666  4.8713345071 
     mu_ini32      mu_ini33      mu_ini34      mu_ini35 
 4.7107116580  4.5471434622  4.8540773708  4.6129553933 
     mu_ini36      mu_ini37      mu_ini38      mu_ini39 
 4.5134108799  4.6231016082  4.5823454113  4.5969785420 
     mu_ini40 
 4.5835763300 

neg-LogLikelihood:  -5047251 

Comparing both calibrations of the model, we can see the second one provides a lower value for the objective function. Also, comparing the correlation between the 'real' annual deviates and the estimated ones, we have a higher correlation for the second calibration:

cor(gamma, fit0$par$gamma)
[1] 0.8708379
cor(gamma, fit$par$gamma)
[1] 0.9871758

And looking at the outputs, we can see we have some problems with the estimation of the initial expected values (underestimated for all sites) in the first calibration (with real data, normally a multi-phase calibration works way better):

par(mfrow=c(3,2), mar=c(3,5,1,1), oma=c(1,1,1,1))
for(i in 1:4) {
  ylim=c(0, 1.1*log(max(fit$fitted, n)))
  plot(log(fit$fitted[,i]), type="l", col="blue", ylim=ylim,
       ylab="mu (log)")
  lines(log(fit0$fitted[,i]), col="green")
  points(log(mu[,i]), col="red")
  mtext(paste("Site ", i), 3, adj=0.05, line=-2)
  if(i==3) {
    mtext(c("observed", "fitModel0", "fitModel1"), 1, adj=0.95, 
          line=-1.5:-3.5, col=c("red", "green", "blue"), cex=0.8)
  }
}

mus = rbind(mu_ini, fit$par$mu_ini, fit0$par$mu_ini)
barplot(mus, beside=TRUE, col=c("red", "blue", "green"),
        ylab="Initial expected population",
        xlab="Sites", border=NA)

gam = rbind(gamma, fit$par$gamma, fit0$par$gamma)
barplot(gam, beside=TRUE, col=c("red", "blue", "green"),
        ylab="Annual deviates", border=NA)

theplot

Finally,

system.time(fitModel(obs=n, env=env, T=T, L=L))

   user  system elapsed 
   9.85    0.00    9.85 

Which is around four time slower than the solution proposed by @Thierry using INLA (from summary(model)):

Time used:
 Pre-processing    Running inla Post-processing           Total 
         0.1070          2.3131          0.0460          2.4661

However, after byte compiling my functions, we get:

   user  system elapsed 
   7.53    0.00    7.53

It's 24% faster, and now is only 3 times slower than the INLA method. Still, I think is reasonable even for thousands of experiments (my own model takes 5 days just for one optimization, so maybe I have a bias here) and since we're using simulated data, we can compare the reliability of the parameter estimates in addition to the computer time.

# The functions -----------------------------------------------------------

require(compiler)

dynamics = function(par, obs, x, T, L) {

  alpha  = par$alpha
  beta   = par$beta
  gamma  = if(!is.null((par$gamma))) par$gamma else rep(0, T-1)
  mu_ini = if(!is.null(par$mu_ini)) exp(par$mu_ini) else obs[1,]

  mu = matrix(nrow=T, ncol=L)

  mu[1,] = mu_ini

  for(t in seq_len(T-1)) {
    log_mu_new = log(mu[t,]) + alpha + beta*x[t,] + gamma[t]
    mu[t+1, ] = exp(log_mu_new)
  }
  return(mu)
}

dynamics = cmpfun(dynamics)

reListPars = function(par) {
  out = list()
  out$alpha = as.numeric(par["alpha"])
  out$beta  = as.numeric(par["beta"])
  if(!is.na(par["sd"])) out$sd = as.numeric(par["sd"])
  gammas =  as.numeric(par[grepl("gamma", names(par))])
  if(length(gammas)>0) out$gamma = gammas
  mu_inis = as.numeric(par[grepl("mu_ini", names(par))])
  if(length(mu_inis)>0) out$mu_ini = mu_inis
  return(out)
}

reListPars = cmpfun(reListPars)

.getLogLike = function(par, obs, env, T, L) {
  par = reListPars(par)
  if(is.null(par$sd)) {
    par$sd = if(!is.null(par$gamma)) sd(par$gamma)+0.01 else 1
  }
  mu = dynamics(par=par, obs=obs, x=env, T=T, L=L)
  logLike = sum(obs*log(mu) - mu) - sum(par$gamma^2/(2*par$sd^2))
  return(-logLike)
}

.getLogLike = cmpfun(.getLogLike)

.getUpper = function(par) {
  par$alpha = 10*par$alpha + 1
  par$beta  = 10*abs(par$beta) + 1
  if(!is.null(par$gamma)) {
    if(!is.null(par$sd)) sd = par$sd else sd=sd(par$gamma)
    if(sd==0) sd=1
    par$gamma = rep(qnorm(0.999, sd=sd), length(par$gamma))
  }
  if(!is.null(par$mu_ini)) par$mu_ini = 5*par$mu_ini
  if(!is.null(par$sd)) par$sd = 10*par$sd
  par = unlist(par)
  return(par)
}

.getUpper = cmpfun(.getUpper)

.getLower = function(par) {
  par$alpha = 0 # alpha>0?
  par$beta  = -10*abs(par$beta) - 1
  if(!is.null(par$gamma)) {
    if(!is.null(par$sd)) sd = par$sd else sd=sd(par$gamma)
    if(sd==0) sd=1
      par$gamma = rep(qnorm(1-0.999, sd=sd), length(par$gamma))
  }
  if(!is.null(par$mu_ini)) par$mu_ini = 0.2*par$mu_ini
  if(!is.null(par$sd)) par$sd = 0.0001*par$sd
  par = unlist(par)
  return(par)
}

.getLower = cmpfun(.getLower)

fitModel = function(obs, env, T, L) {

  r = log(obs[-1,]/obs[-T,])
  guess = data.frame(rate=as.numeric(r), env=as.numeric(env[-T,]))
  mod1 = lm(rate ~ env, data=guess)

  output = list()
  output$par = NULL

  # Phase 1: alpha an beta only
  cat("Phase 1: alpha and beta only\n")

  par = list()
  par$alpha = as.numeric(coef(mod1)[1])
  par$beta  = as.numeric(coef(mod1)[2])

  opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL, 
              obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
              upper=.getUpper(par), lower=.getLower(par))
  opt$bound = data.frame(par=unlist(par), low=.getLower(par), 
                         upp=.getUpper(par))

  output$phase1 = opt

  cat("Optimal parameters: \n")
  print(opt$par)
  cat("\nneg-LogLikelihood: ", opt$value, "\n")

  # phase 2: alpha, beta and gamma
  cat("Phase 2: alpha, beta and gamma\n")

  optpar = reListPars(opt$par)
  par$alpha = optpar$alpha
  par$beta  = optpar$beta
  par$gamma = rep(0, T-1)

  opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL, 
              obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
              upper=.getUpper(par), lower=.getLower(par))
  opt$bound = data.frame(par=unlist(par), low=.getLower(par), 
                           upp=.getUpper(par))

  output$phase2 = opt

  cat("Optimal parameters: \n")
  print(opt$par)
  cat("\nneg-LogLikelihood: ", opt$value, "\n")

  # phase 3: alpha, beta, gamma and mu_ini
  cat("Phase 3: alpha, beta, gamma and mu_ini\n")

  optpar = reListPars(opt$par)
  par$alpha = optpar$alpha
  par$beta  = optpar$beta
  par$gamma = optpar$gamma
  par$mu_ini = log(obs[1,])

  opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL, 
              obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
              upper=.getUpper(par), lower=.getLower(par),
              control=list(maxit=1000))
  opt$bound = data.frame(par=unlist(par), low=.getLower(par), 
                         upp=.getUpper(par))

  output$phase3 = opt

  cat("Optimal parameters: \n")
  print(opt$par)
  cat("\nneg-LogLikelihood: ", opt$value, "\n")

  output$par = reListPars(opt$par)

  output$fitted = dynamics(par=output$par, obs=obs, x=env, T=T, L=L)
  output$observed = obs
  output$env = env

  return(output)

}

fitModel = cmpfun(fitModel)

fitModel0 = function(obs, env, T, L) {

  output = list()
  output$par = NULL

  par = list()
  par$alpha = 0
  par$beta  = 0
  par$gamma = rep(0, T-1)
  par$mu_ini = log(obs[1,])

  opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL, 
              obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
              upper=.getUpper(par), lower=.getLower(par))
  opt$bound = data.frame(par=unlist(par), low=.getLower(par), 
                         upp=.getUpper(par))

  output$phase1 = opt

  cat("Optimal parameters: \n")
  print(opt$par)
  cat("\nneg-LogLikelihood: ", opt$value, "\n")

  output$par = reListPars(opt$par)

  output$fitted = dynamics(par=output$par, obs=obs, x=env, T=T, L=L)
  output$observed = obs
  output$env = env

  return(output)

}

fitModel0 = cmpfun(fitModel0)
like image 79
Ricardo Oliveros-Ramos Avatar answered Sep 29 '22 07:09

Ricardo Oliveros-Ramos


Have a look at the INLA package http://www.r-inla.org

It is Bayesian, but uses Integrated nested Laplace approximation which makes the speed of a model compareable to that of frequentist models (glm, glmm).

Starting from mu and env from Ricardo Oliveros-Ramos with L = 40. First prepare the dataset

dataset <- data.frame(
  count = rpois(length(mu), lambda = mu),
  year = rep(seq_len(T), L),
  site = rep(seq_len(L), each = T),
  env = as.vector(env)
)
library(reshape2)
n <- as.matrix(dcast(year ~ site, data = dataset, value.var = "count")[, -1])
dataset$year2 <- dataset$year

Run the model

library(INLA)
system.time(
  model <- inla(
    count ~ 
      env +
      f(year, model = "ar1", replicate = site) + 
      f(year2, model = "iid"), 
    data = dataset,
    family = "poisson"
  )
)
   user  system elapsed 
   0.18    0.14    3.77

Compare the speed with the solution from Ricardo

system.time(test <- fitModel(obs=n, env=env, T=T, L=L))
   user  system elapsed 
  11.06    0.00   11.06 

Compare the speed with a frequentist glmm (without autocorrelation)

library(lme4)
system.time(
  m <- glmer(
    count ~ env + (1|site) + (1|year), 
    data = dataset, 
    family = poisson
  )
)
   user  system elapsed 
   0.44    0.00    0.44 

The speed of inla without autocorrelation

system.time(
  model <- inla(
    count ~ 
      env +
      f(site, model = "iid") + 
      f(year, model = "iid"), 
    data = dataset,
    family = "poisson"
  )
)
   user  system elapsed 
   0.19    0.11    2.09
like image 25
Thierry Avatar answered Sep 29 '22 08:09

Thierry