Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate data where cell counts are random, but row sums always the same

Tags:

r

I'm in a situation where I need to create a bunch of fake datasets where the sum of two variables is the same as in my real data, but the counts for each variable are random. Here's the setup:

>df
    X.1  X.2
1   145   30
2    55   73   

The first row sums to 175, and the second to 128. What I'm looking for is a way to generate a data frame (or a bunch of data frames) like this:

>df.2
    X.1  X.2
1   100   75
2    90   38

In df.2, the cell counts have changed, but the rows still sum to the same table. The actual data has hundreds of rows, but only two variables if that helps. I've tried to figure out how to do this with sample() but haven't had any luck. Any suggestions?

Thanks!

like image 274
bosbmgatl Avatar asked Aug 20 '12 00:08

bosbmgatl


1 Answers

Perhaps you're looking for r2dtable?

> r2dtable(2, c(175,128), c(190, 113))
[[1]]
     [,1] [,2]
[1,]  108   67
[2,]   82   46

[[2]]
     [,1] [,2]
[1,]  114   61
[2,]   76   52

Also, here's a version of @mnel's answer that uses rmultinom to do the n replicates and then combines the results. Not that it really matters if you only need a few replicates, but since rmultinom could do it, I thought I'd see how it might be done.

n <- 10
e <- cbind(X1  = c(100,90,30),X2 = c(75,28,120))
aperm(array(sapply(1:nrow(e), function(i) 
        rmultinom(n, rowSums(e)[i], (e/rowSums(e))[i,])),
      dim=c(ncol(e),n,nrow(e))), c(3,1,2))
like image 97
Aaron left Stack Overflow Avatar answered Oct 08 '22 19:10

Aaron left Stack Overflow