Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate marginal probabilities for generating correlated binary variables

Tags:

r

bindata

Let's say I want to create 50 binary variables of length 100 that are each correlated with each other.

For I create a correlation matrix with the specified rho's:

cor.mat <- matrix(0.2,nrow=50, ncol=50)
diag(cor.mat) <- 1

next I use rmvbin:

library(bindata)
rmvbin<-rmvbin(100, margprob=x, bincorr=cor.mat)

However, I'm not sure how to calculate the margprob argument. Can someone help?

SHould it be a vector of the sum of probabilities in each row and column?

like image 993
user1984076 Avatar asked Sep 18 '13 12:09

user1984076


2 Answers

margprob should simply be a repeated vector of the probability that any single binary variable is 1, independent of the rest; call this value p. Assuming identically distributed variables (which given your correlation matrix seems to be the case), margprob=rep(p,50).

It should NOT be a vector of the sum of probabilities in each row and column, as the correlation matrix cannot be used to determine the marginal probabilities. If you're having trouble figuring out what the marginal probabilities for your random variables are, you will have to give more context for the problem and it would be a question more appropriate for math.stackexchange.com.

like image 147
jpd527 Avatar answered Nov 18 '22 18:11

jpd527


I think the problem has been that people saw the solution as being either too simple or not properly specified. You don't actually calculate the marginal probabilities ... you specify them. Then the rmvbin function uses the specification of the marginal probabilities and the joint correlations to do the needed sampling to (on average) give joint distributions that match those specs.

library(bindata)
rmvbin<-rmvbin(100, margprob=rep(.5,50), bincorr=cor.mat)

> str(rmvbin)
 num [1:100, 1:50] 0 0 0 1 0 0 0 1 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL

So to look at the sampling characteristics of this result, you can see what correlation there is with the first column:

Hmisc::describe(apply(rmvbin[,-1], 2, function(col) cor(col, rmvbin[,1]) ) )
apply(rmvbin[, -1], 2, function(col) cor(col, rmvbin[, 1])) 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
     49       0      38  0.2009 0.05886 0.09874 0.13309 0.19372 0.25208 0.29723 
    .95 
0.33772 

lowest : 0.03508 0.04013 0.08696 0.09874 0.10889
highest: 0.29942 0.32450 0.34653 0.40902 0.46714 

So the average correlation under sampling was pretty close to the nominal value 0.2. but it did vary fairly widely.

like image 29
IRTFM Avatar answered Nov 18 '22 19:11

IRTFM