My task is to assess how various environmental variables affect annual population fluctuations. For this, I need to fit poisson autoregressive model for time-series counts:
Where Ni,j is the count of observed individuals at site i
in year j
, x_{i,j}
is environmental variable at site i
in year j
- these are the input data, and the rest are parameters: \mu_{i,j}
is the expected number of individuals at site i
in year j
, and \gamma_{j}
is random effect for each year.
Is it possible to fit such a model in R? I want to avoid fitting it in Bayesian framework since the computation takes way to long (I have to process 5000 of such models) I tried to transform the model for GLM, but once I had to add the random effect (gamma) it is no longer possible.
Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.
Graphical approaches to assessing the lag of an autoregressive model include looking at the ACF and PACF values versus the lag. In a plot of ACF versus the lag, if you see large ACF values and a non-random pattern, then likely the values are serially correlated.
Lags: When working with time series data, each data point across time is called a lag. Bias or weights: The b values in the example above are weights. They are values that describe the correlation between the input variables and the variable that we are forecasting.
First, let's create some simulated data (all the ad hoc functions at the end of the answer):
set.seed(12345) # updated to T=20 and L=40 for comparative purposes.
T = 20 # number of years
L = 40 # number of sites
N0 = 100 # average initial pop (to simulate data)
sd_env = 0.8 # to simulate the env (assumed mean 0)
env = matrix(rnorm(T*L, mean=0, sd=sd_env), nrow=T, ncol=L)
# 'real' parameters
alpha = 0.1
beta = 0.05
sd = 0.4
gamma = rnorm(T-1, mean=0, sd=sd)
mu_ini = log(rpois(n=L, lambda=N0)) #initial means
par_real = list(alpha=alpha, beta=beta, gamma=gamma,
sd=sd, mu_ini=mu_ini)
mu = dynamics(par=par_real, x=env, T=T, L=L)
# observed abundances
n = matrix(rpois(length(mu), lambda=mu), nrow=T, ncol=L)
Now, for a given set of parameters, we can simulate the expected number of individuals at each site and year. And since we have the observed number of individuals, we can write the likelihood function for the observations (being Poisson distributed) and add a penalty for the annual deviates in the growth rate (to make it normal distributed). For this, the function dynamics
will simulate the model and the function .getLogLike
will calculate the objective function. Now we need to optimize the objective function. The parameters to estimate are alpha
, beta
, the annual deviates (gamma
) and the initial expected number of individuals (mu_ini
), and maybe sigma
.
For the first try, we can provide 0 for all parameters as initial guesses but for the initial expected numbers for which we can use the initial observed abundances (that's the MLE anyway).
fit0 = fitModel0(obs=n, env=env, T=T, L=L)
Optimal parameters:
alpha beta gamma1 gamma2 gamma3
0.28018842 0.05464360 -0.12904373 -0.15795001 -0.04502903
gamma4 gamma5 gamma6 gamma7 gamma8
0.05045117 0.08435066 0.28864816 0.24111786 -0.80569709
gamma9 gamma10 gamma11 gamma12 gamma13
0.22786951 0.10326086 -0.50096088 -0.08880594 -0.33392310
gamma14 gamma15 gamma16 gamma17 gamma18
0.22664634 -0.47028311 0.11782381 -0.16328820 0.04208037
gamma19 mu_ini1 mu_ini2 mu_ini3 mu_ini4
0.17648808 4.14267523 4.19187205 4.05573114 3.90406443
mu_ini5 mu_ini6 mu_ini7 mu_ini8 mu_ini9
4.08975038 4.17185883 4.03679049 4.23091760 4.04940333
mu_ini10 mu_ini11 mu_ini12 mu_ini13 mu_ini14
4.19355333 4.05543081 4.15598515 4.18266682 4.09095730
mu_ini15 mu_ini16 mu_ini17 mu_ini18 mu_ini19
4.17922360 3.87211968 4.04509178 4.19385641 3.98403521
mu_ini20 mu_ini21 mu_ini22 mu_ini23 mu_ini24
4.08531659 4.19294203 4.29891769 4.21025211 4.16297457
mu_ini25 mu_ini26 mu_ini27 mu_ini28 mu_ini29
4.19265543 4.28925869 4.10752810 4.10957212 4.14953247
mu_ini30 mu_ini31 mu_ini32 mu_ini33 mu_ini34
4.09690570 4.34234547 4.18169575 4.01663339 4.32713905
mu_ini35 mu_ini36 mu_ini37 mu_ini38 mu_ini39
4.08121891 3.98256819 4.08658375 4.05942834 4.06988174
mu_ini40
4.05655031
This works, but normally some parameters can be correlated and more difficult to identify from data, so a sequential approach can be used (can read Bolker et al. 2013 for more info). In this case, we increase progresively the number of parameters, improving the initial guess for the optimization at each phase of the calibration. For this example, the first phase only estimate alpha
and beta
, and using guesses for a linear model of the growth rate and the environment. Then, in the second phase we use the estimates from the first optimization and add the annual deviates as parameters (gamma
). Finally, we use the estimates of the second optimization and add the initial expected values as parameters. We add the initial expected values last assuming the initial observed values are already very close and a good guess to start, but on the other side we have no clear idea of the values of the remaining parameters.
fit = fitModel(obs=n, env=env, T=T, L=L)
Phase 1: alpha and beta only
Optimal parameters:
alpha beta
0.18172961 0.06323379
neg-LogLikelihood: -5023687
Phase 2: alpha, beta and gamma
Optimal parameters:
alpha beta gamma1 gamma2 gamma3
0.20519928 0.06238850 -0.35908716 -0.21453015 -0.05661066
gamma4 gamma5 gamma6 gamma7 gamma8
0.18963170 0.17800563 0.34303170 0.28960181 -0.72374927
gamma9 gamma10 gamma11 gamma12 gamma13
0.28464203 0.16900331 -0.40719047 -0.01292168 -0.25535610
gamma14 gamma15 gamma16 gamma17 gamma18
0.28806711 -0.38924648 0.19224527 -0.07875934 0.10880154
gamma19
0.24518786
neg-LogLikelihood: -5041345
Phase 3: alpha, beta, gamma and mu_ini
Optimal parameters:
alpha beta gamma1 gamma2
0.1962334008 0.0545361273 -0.4298024242 -0.1984379386
gamma3 gamma4 gamma5 gamma6
0.0240318556 0.1909639571 0.1116636126 0.3465693397
gamma7 gamma8 gamma9 gamma10
0.3478695629 -0.7500599493 0.3600551021 0.1361405398
gamma11 gamma12 gamma13 gamma14
-0.3874453347 -0.0005839263 -0.2305008546 0.2819608670
gamma15 gamma16 gamma17 gamma18
-0.3615273177 0.1792020332 -0.0685681922 0.1203006457
gamma19 mu_ini1 mu_ini2 mu_ini3
0.2506129351 4.6639314468 4.7205977429 4.5802529032
mu_ini4 mu_ini5 mu_ini6 mu_ini7
4.4293994068 4.6182382472 4.7039110982 4.5668031666
mu_ini8 mu_ini9 mu_ini10 mu_ini11
4.7610910879 4.5844180026 4.7226353021 4.5823048717
mu_ini12 mu_ini13 mu_ini14 mu_ini15
4.6814189824 4.7130039559 4.6135420745 4.7100006841
mu_ini16 mu_ini17 mu_ini18 mu_ini19
4.4080115751 4.5758092977 4.7209394881 4.5150790425
mu_ini20 mu_ini21 mu_ini22 mu_ini23
4.6171948847 4.7141188899 4.8303375556 4.7392110431
mu_ini24 mu_ini25 mu_ini26 mu_ini27
4.6893526309 4.7237687961 4.8234804183 4.6333012324
mu_ini28 mu_ini29 mu_ini30 mu_ini31
4.6392335265 4.6817044754 4.6260620666 4.8713345071
mu_ini32 mu_ini33 mu_ini34 mu_ini35
4.7107116580 4.5471434622 4.8540773708 4.6129553933
mu_ini36 mu_ini37 mu_ini38 mu_ini39
4.5134108799 4.6231016082 4.5823454113 4.5969785420
mu_ini40
4.5835763300
neg-LogLikelihood: -5047251
Comparing both calibrations of the model, we can see the second one provides a lower value for the objective function. Also, comparing the correlation between the 'real' annual deviates and the estimated ones, we have a higher correlation for the second calibration:
cor(gamma, fit0$par$gamma)
[1] 0.8708379
cor(gamma, fit$par$gamma)
[1] 0.9871758
And looking at the outputs, we can see we have some problems with the estimation of the initial expected values (underestimated for all sites) in the first calibration (with real data, normally a multi-phase calibration works way better):
par(mfrow=c(3,2), mar=c(3,5,1,1), oma=c(1,1,1,1))
for(i in 1:4) {
ylim=c(0, 1.1*log(max(fit$fitted, n)))
plot(log(fit$fitted[,i]), type="l", col="blue", ylim=ylim,
ylab="mu (log)")
lines(log(fit0$fitted[,i]), col="green")
points(log(mu[,i]), col="red")
mtext(paste("Site ", i), 3, adj=0.05, line=-2)
if(i==3) {
mtext(c("observed", "fitModel0", "fitModel1"), 1, adj=0.95,
line=-1.5:-3.5, col=c("red", "green", "blue"), cex=0.8)
}
}
mus = rbind(mu_ini, fit$par$mu_ini, fit0$par$mu_ini)
barplot(mus, beside=TRUE, col=c("red", "blue", "green"),
ylab="Initial expected population",
xlab="Sites", border=NA)
gam = rbind(gamma, fit$par$gamma, fit0$par$gamma)
barplot(gam, beside=TRUE, col=c("red", "blue", "green"),
ylab="Annual deviates", border=NA)
Finally,
system.time(fitModel(obs=n, env=env, T=T, L=L))
user system elapsed
9.85 0.00 9.85
Which is around four time slower than the solution proposed by @Thierry using INLA (from summary(model)
):
Time used:
Pre-processing Running inla Post-processing Total
0.1070 2.3131 0.0460 2.4661
However, after byte compiling my functions, we get:
user system elapsed
7.53 0.00 7.53
It's 24% faster, and now is only 3 times slower than the INLA method. Still, I think is reasonable even for thousands of experiments (my own model takes 5 days just for one optimization, so maybe I have a bias here) and since we're using simulated data, we can compare the reliability of the parameter estimates in addition to the computer time.
# The functions -----------------------------------------------------------
require(compiler)
dynamics = function(par, obs, x, T, L) {
alpha = par$alpha
beta = par$beta
gamma = if(!is.null((par$gamma))) par$gamma else rep(0, T-1)
mu_ini = if(!is.null(par$mu_ini)) exp(par$mu_ini) else obs[1,]
mu = matrix(nrow=T, ncol=L)
mu[1,] = mu_ini
for(t in seq_len(T-1)) {
log_mu_new = log(mu[t,]) + alpha + beta*x[t,] + gamma[t]
mu[t+1, ] = exp(log_mu_new)
}
return(mu)
}
dynamics = cmpfun(dynamics)
reListPars = function(par) {
out = list()
out$alpha = as.numeric(par["alpha"])
out$beta = as.numeric(par["beta"])
if(!is.na(par["sd"])) out$sd = as.numeric(par["sd"])
gammas = as.numeric(par[grepl("gamma", names(par))])
if(length(gammas)>0) out$gamma = gammas
mu_inis = as.numeric(par[grepl("mu_ini", names(par))])
if(length(mu_inis)>0) out$mu_ini = mu_inis
return(out)
}
reListPars = cmpfun(reListPars)
.getLogLike = function(par, obs, env, T, L) {
par = reListPars(par)
if(is.null(par$sd)) {
par$sd = if(!is.null(par$gamma)) sd(par$gamma)+0.01 else 1
}
mu = dynamics(par=par, obs=obs, x=env, T=T, L=L)
logLike = sum(obs*log(mu) - mu) - sum(par$gamma^2/(2*par$sd^2))
return(-logLike)
}
.getLogLike = cmpfun(.getLogLike)
.getUpper = function(par) {
par$alpha = 10*par$alpha + 1
par$beta = 10*abs(par$beta) + 1
if(!is.null(par$gamma)) {
if(!is.null(par$sd)) sd = par$sd else sd=sd(par$gamma)
if(sd==0) sd=1
par$gamma = rep(qnorm(0.999, sd=sd), length(par$gamma))
}
if(!is.null(par$mu_ini)) par$mu_ini = 5*par$mu_ini
if(!is.null(par$sd)) par$sd = 10*par$sd
par = unlist(par)
return(par)
}
.getUpper = cmpfun(.getUpper)
.getLower = function(par) {
par$alpha = 0 # alpha>0?
par$beta = -10*abs(par$beta) - 1
if(!is.null(par$gamma)) {
if(!is.null(par$sd)) sd = par$sd else sd=sd(par$gamma)
if(sd==0) sd=1
par$gamma = rep(qnorm(1-0.999, sd=sd), length(par$gamma))
}
if(!is.null(par$mu_ini)) par$mu_ini = 0.2*par$mu_ini
if(!is.null(par$sd)) par$sd = 0.0001*par$sd
par = unlist(par)
return(par)
}
.getLower = cmpfun(.getLower)
fitModel = function(obs, env, T, L) {
r = log(obs[-1,]/obs[-T,])
guess = data.frame(rate=as.numeric(r), env=as.numeric(env[-T,]))
mod1 = lm(rate ~ env, data=guess)
output = list()
output$par = NULL
# Phase 1: alpha an beta only
cat("Phase 1: alpha and beta only\n")
par = list()
par$alpha = as.numeric(coef(mod1)[1])
par$beta = as.numeric(coef(mod1)[2])
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase1 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
# phase 2: alpha, beta and gamma
cat("Phase 2: alpha, beta and gamma\n")
optpar = reListPars(opt$par)
par$alpha = optpar$alpha
par$beta = optpar$beta
par$gamma = rep(0, T-1)
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase2 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
# phase 3: alpha, beta, gamma and mu_ini
cat("Phase 3: alpha, beta, gamma and mu_ini\n")
optpar = reListPars(opt$par)
par$alpha = optpar$alpha
par$beta = optpar$beta
par$gamma = optpar$gamma
par$mu_ini = log(obs[1,])
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par),
control=list(maxit=1000))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase3 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
output$par = reListPars(opt$par)
output$fitted = dynamics(par=output$par, obs=obs, x=env, T=T, L=L)
output$observed = obs
output$env = env
return(output)
}
fitModel = cmpfun(fitModel)
fitModel0 = function(obs, env, T, L) {
output = list()
output$par = NULL
par = list()
par$alpha = 0
par$beta = 0
par$gamma = rep(0, T-1)
par$mu_ini = log(obs[1,])
opt = optim(par=unlist(par), fn=.getLogLike, gr=NULL,
obs=obs, env=env, T=T, L=L, method="L-BFGS-B",
upper=.getUpper(par), lower=.getLower(par))
opt$bound = data.frame(par=unlist(par), low=.getLower(par),
upp=.getUpper(par))
output$phase1 = opt
cat("Optimal parameters: \n")
print(opt$par)
cat("\nneg-LogLikelihood: ", opt$value, "\n")
output$par = reListPars(opt$par)
output$fitted = dynamics(par=output$par, obs=obs, x=env, T=T, L=L)
output$observed = obs
output$env = env
return(output)
}
fitModel0 = cmpfun(fitModel0)
Have a look at the INLA package http://www.r-inla.org
It is Bayesian, but uses Integrated nested Laplace approximation which makes the speed of a model compareable to that of frequentist models (glm, glmm).
Starting from mu
and env
from Ricardo Oliveros-Ramos with L = 40. First prepare the dataset
dataset <- data.frame(
count = rpois(length(mu), lambda = mu),
year = rep(seq_len(T), L),
site = rep(seq_len(L), each = T),
env = as.vector(env)
)
library(reshape2)
n <- as.matrix(dcast(year ~ site, data = dataset, value.var = "count")[, -1])
dataset$year2 <- dataset$year
Run the model
library(INLA)
system.time(
model <- inla(
count ~
env +
f(year, model = "ar1", replicate = site) +
f(year2, model = "iid"),
data = dataset,
family = "poisson"
)
)
user system elapsed
0.18 0.14 3.77
Compare the speed with the solution from Ricardo
system.time(test <- fitModel(obs=n, env=env, T=T, L=L))
user system elapsed
11.06 0.00 11.06
Compare the speed with a frequentist glmm (without autocorrelation)
library(lme4)
system.time(
m <- glmer(
count ~ env + (1|site) + (1|year),
data = dataset,
family = poisson
)
)
user system elapsed
0.44 0.00 0.44
The speed of inla without autocorrelation
system.time(
model <- inla(
count ~
env +
f(site, model = "iid") +
f(year, model = "iid"),
data = dataset,
family = "poisson"
)
)
user system elapsed
0.19 0.11 2.09
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With