I am using the functions defined here: Extreme value analysis and user defined probability functions in Stan for modeling the data with a generalized pareto distribution, but my problem is that my model is in a for-loop and expects three real valued arguments, whereas, the gpd functions assume a vector, real, real argument.
I’m not so sure that my model chunk is so amenable to being vectorized, and so I was thinking I would need to have the gpd functions take in real valued arguments (but maybe I’m wrong).
I’d appreciate any help with switching the code around to achieve this. Here is my stan code
functions {
real gpareto_lpdf(vector y, real k, real sigma) {
// generalised Pareto log pdf
int N = rows(y);
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma)
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
else
return -sum(y)/sigma -N*log(sigma); // limit k->0
}
real gpareto_lcdf(vector y, real k, real sigma) {
// generalised Pareto log cdf
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma)
if (fabs(k) > 1e-15)
return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
else
return sum(log1m_exp(-(y)/sigma)); // limit k->0
}
}
data {
// the input data
int<lower = 1> n; // number of observations
real<lower = 0> value[n]; // value measurements
int<lower = 0, upper = 1> censored[n]; // vector of 0s and 1s
// parameters for the prior
real<lower = 0> a;
real<lower = 0> b;
}
parameters {
real k;
real sigma;
}
model {
// prior
k ~ gamma(a, b);
sigma ~ gamma(a,b);
// likelihood
for (i in 1:n) {
if (censored[i]) {
target += gpareto_lcdf(value[i] | k, sigma);
} else {
target += gpareto_lpdf(value[i] | k, sigma);
}
}
}
Here is how the log PDF could be adapted. This way, index arrays for subsetting y into censored and non-censored observations can be passed.
real cens_gpareto_lpdf(vector y, int[] cens, int[] no_cens, real k, real sigma) {
// generalised Pareto log pdf
int N = size(cens);
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y[no_cens]) * (k/sigma))) -N*log(sigma) +
sum(log1m_exp((-inv_k)*(log1p((y[cens]) * (k/sigma)))));
else
return -sum(y[no_cens])/sigma -N*log(sigma) +
sum(log1m_exp(-(y[cens])/sigma));
}
Extend the data block: n_cens, n_not_cens, cens, and no_cens are values that need to supplied.
int<lower = 1> n; // total number of obs
int<lower = 1> n_cens; // number of censored obs
int<lower = 1> n_not_cens; // number of regular obs
int cens[n_cens]; // index set censored
int no_cens[n_not_cens]; // index set regular
vector<lower = 0>[n] value; // value measurements
Nonzero Parameters as suggested by gfgm:
parameters {
real<lower=0> k;
real<lower=0> sigma;
}
Rewrite the model block:
model {
// prior
k ~ gamma(a, b);
sigma ~ gamma(a,b);
// likelihood
value ~ cens_gpareto(cens, no_cens, k, sigma);
}
Disclaimer: I neither checked the formulas for sanity nor ran the model using test data. Just compiled via rstan::stan_model() which worked fine. gfgm's suggestion may be more convenient for post-processing / computing stuff in generated quantities etc. I'm not a Stan expert :-).
Edit:
Fixed divergence issue found by gfgm through simulation. The likelihood was ill-defined (N= rows(y) instead of N=size(cens). Runs fine now with gfgm's data (using set.seed(123) and rstan):
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
k 0.16 0.00 0.10 0.02 0.08 0.14 0.21 0.42 1687 1
sigma 0.90 0.00 0.12 0.67 0.82 0.90 0.99 1.16 1638 1
lp__ -106.15 0.03 1.08 -109.09 -106.56 -105.83 -105.38 -105.09 1343 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With