Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

conditional sampling without replacement

Tags:

I am attempting to write a simulation that involves randomly re-assigning items to categories with some restrictions.

Lets say I have a collection of pebbles 1 to N distributed across buckets A through J:

set.seed(100)
df1 <- data.frame(pebble = 1:100, 
                  bucket = sample(LETTERS[1:10], 100, T), 
                  stringsAsFactors = F)
head(df1)
#>   pebble bucket
#> 1      1      D
#> 2      2      C
#> 3      3      F
#> 4      4      A
#> 5      5      E
#> 6      6      E

I want to randomly re-assign pebbles to buckets. Without restrictions I could do it like so:

random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
colSums(table(random.permutation.df1))
#>  A  B  C  D  E  F  G  H  I  J 
#>  4  7 13 14 12 11 11 10  9  9
colSums(table(df1))
#>  A  B  C  D  E  F  G  H  I  J 
#>  4  7 13 14 12 11 11 10  9  9

Importantly this re-assigns pebbles while ensuring that each bucket retains the same number (because we are sampling without replacement).

However, I have a set of restrictions such that certain pebbles cannot be assigned to certain buckets. I encode the restrictions in df2:

df2 <- data.frame(pebble = sample(1:100, 10), 
                  bucket = sample(LETTERS[1:10], 10, T), 
                  stringsAsFactors = F)
df2
#>    pebble bucket
#> 1      33      I
#> 2      39      I
#> 3       5      A
#> 4      36      C
#> 5      55      J
#> 6      66      A
#> 7      92      J
#> 8      95      H
#> 9       2      C
#> 10     49      I

The logic here is that pebbles 33 and 39 cannot be placed in bucket I, or pebble 5 in bucket A, etc. I would like to permute which pebbles are in which bucket subject to these restrictions.

So far, I've thought of tackling it in a loop as below, but this does not result in buckets retaining the same number of pebbles:

perms <- character(0)
cnt <- 1
for (p in df1$pebble) {
  perms[cnt] <- sample(df1$bucket[!df1$bucket %in% df2$bucket[df2$pebble==p]], 1)
  cnt <- cnt + 1
}
table(perms)
#> perms
#>  A  B  C  D  E  F  G  H  I  J 
#>  6  7 12 22 15  1 14  7  7  9

I then tried sampling positions, and then removing that position from the available buckets and the available remaining positions. This is also not working, and I suspect it is because I am sampling my way into branches of the tree that do not yield solutions.

set.seed(42)
perms <- character(0)
cnt <- 1
ids <- 1:nrow(df1)
bckts <- df1$bucket
for (p in df1$pebble) {
  id <- sample(ids[!bckts %in% df2$bucket[df2$pebble==p]], 1)
  perms[cnt] <- bckts[id]
  bckts <- bckts[-id]
  ids <- ids[ids!=id]
  cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G J 
#> 1 1 4 1 2 1 2 2 

Any thoughts or advice much appreciated (and apologies for the length).

EDIT:

I foolishly forgot to clarify that I was previously solving this by just resampling until I got a draw that didn't violate any of the conditions in df2, but I now have many conditions such that this would make my code take too long to run. I am still up for trying to force it if I could figure out a way to make forcing it faster.

like image 618
gfgm Avatar asked May 03 '18 16:05

gfgm


People also ask

What is an example of sampling without replacement?

In sampling without replacement, each sample unit of the population has only one chance to be selected in the sample. For example, if one draws a simple random sample such that no unit occurs more than one time in the sample, the sample is drawn without replacement.

What happens when you sample without replacement?

When we sample with replacement, the two sample values are independent. Practically, this means that what we get on the first one doesn't affect what we get on the second. Mathematically, this means that the covariance between the two is zero. In sampling without replacement, the two sample values aren't independent.

What is the formula of sampling without replacement?

For random samples of size n selected without replacement from an underlying population, the variance of the mean of all possible samples is equal to the modified variance of the underlying population divided by the sample size, multiplied by the finite population correction (FPC) factor.


1 Answers

I have a solution (I managed to write it in base R, but the data.table solution is easier to understand and write:

random.permutation.df2 <- data.frame(pebble = df1$pebble, bucket = rep(NA,length(df1$pebble)))
for(bucket in unique(df1$bucket)){
  N <-  length( random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) & 
                                         !random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] )
  random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) & 
                                  !random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] <- 
    sample(c(rep(bucket,sum(df1$bucket == bucket)),rep(NA,N-sum(df1$bucket == bucket))))

}

The idea is to sample the authorised peeble for each bucket: those that are not in df2, and those that are not already filled. You sample then a vector of the good length, choosing between NAs (for the following buckets values) and the value in the loop, and voilà.

Now easier to read with data.table

library(data.table)
random.permutation.df2 <- setDT(random.permutation.df2)
df2 <- setDT(df2)

for( bucketi in unique(df1$bucket)){
 random.permutation.df2[is.na(bucket) & !pebble %in% df2[bucket == bucketi, pebble], 
                        bucket := sample(c(rep(bucketi,sum(df1$bucket == bucket)),rep(NA,.N-sum(df1$bucket == bucket))))] 
}

it has the two conditions

> colSums(table(df1))
 A  B  C  D  E  F  G  H  I  J 
 4  7 13 14 12 11 11 10  9  9 
> colSums(table(random.permutation.df2))
 A  B  C  D  E  F  G  H  I  J 
 4  7 13 14 12 11 11 10  9  9 

To verify that there isn't any contradiction with df2

> df2
    pebble bucket
 1:     37      D
 2:     95      H
 3:     90      C
 4:     80      C
 5:     31      D
 6:     84      G
 7:     76      I
 8:     57      H
 9:      7      E
10:     39      A
> random.permutation.df2[pebble %in% df2$pebble,.(pebble,bucket)]
    pebble bucket
 1:      7      D
 2:     31      H
 3:     37      J
 4:     39      F
 5:     57      B
 6:     76      E
 7:     80      F
 8:     84      B
 9:     90      H
10:     95      D
like image 189
denis Avatar answered Oct 12 '22 12:10

denis