What is the proper way to format a categorical predictor to use in STAN? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?
For example, say I had a a continue predictor and a categorical predictor
your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))
Which looks like this:
income country
1 62085.59 England
2 60806.33 England
3 60527.27 England
4 67112.64 USA
5 57675.92 USA
6 58128.44 USA
7 60822.47 South Africa
8 55805.80 South Africa
9 63982.99 South Africa
10 64555.45 Belgium
How would I prepare this to be entered in rstan
?
It is correct that Stan only inputs real or integeger variables. In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category). In R, you can do something like
dummy_variables <- model.matrix(~ country, data = your_dataset)
Which will look like this
(Intercept) countryEngland countrySouth Africa countryUSA
1 1 1 0 0
2 1 1 0 0
3 1 1 0 0
4 1 0 0 1
5 1 0 0 1
6 1 0 0 1
7 1 0 1 0
8 1 0 1 0
9 1 0 1 0
10 1 0 0 0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"
However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables. This approach can be taken a step farther by inputting the entire model formula like
X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)
Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as
data {
int<lower=1> N;
int<lower=1> K;
matrix[N,K] X;
vector[N] y;
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model {
y ~ normal(X * beta, sigma); // likelihood
// priors
}
Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With