Let's say I want to create 50 binary variables of length 100 that are each correlated with each other.
For I create a correlation matrix with the specified rho's:
cor.mat <- matrix(0.2,nrow=50, ncol=50)
diag(cor.mat) <- 1
next I use rmvbin
:
library(bindata)
rmvbin<-rmvbin(100, margprob=x, bincorr=cor.mat)
However, I'm not sure how to calculate the margprob argument. Can someone help?
SHould it be a vector of the sum of probabilities in each row and column?
margprob
should simply be a repeated vector of the probability that any single binary variable is 1, independent of the rest; call this value p
. Assuming identically distributed variables (which given your correlation matrix seems to be the case), margprob=rep(p,50)
.
It should NOT be a vector of the sum of probabilities in each row and column, as the correlation matrix cannot be used to determine the marginal probabilities. If you're having trouble figuring out what the marginal probabilities for your random variables are, you will have to give more context for the problem and it would be a question more appropriate for math.stackexchange.com.
I think the problem has been that people saw the solution as being either too simple or not properly specified. You don't actually calculate the marginal probabilities ... you specify them. Then the rmvbin
function uses the specification of the marginal probabilities and the joint correlations to do the needed sampling to (on average) give joint distributions that match those specs.
library(bindata)
rmvbin<-rmvbin(100, margprob=rep(.5,50), bincorr=cor.mat)
> str(rmvbin)
num [1:100, 1:50] 0 0 0 1 0 0 0 1 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : NULL
So to look at the sampling characteristics of this result, you can see what correlation there is with the first column:
Hmisc::describe(apply(rmvbin[,-1], 2, function(col) cor(col, rmvbin[,1]) ) )
apply(rmvbin[, -1], 2, function(col) cor(col, rmvbin[, 1]))
n missing unique Mean .05 .10 .25 .50 .75 .90
49 0 38 0.2009 0.05886 0.09874 0.13309 0.19372 0.25208 0.29723
.95
0.33772
lowest : 0.03508 0.04013 0.08696 0.09874 0.10889
highest: 0.29942 0.32450 0.34653 0.40902 0.46714
So the average correlation under sampling was pretty close to the nominal value 0.2. but it did vary fairly widely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With