conditional sampling without replacement

Tags:

I am attempting to write a simulation that involves randomly re-assigning items to categories with some restrictions.

Lets say I have a collection of pebbles 1 to N distributed across buckets A through J:

set.seed(100)
df1 <- data.frame(pebble = 1:100, 
                  bucket = sample(LETTERS[1:10], 100, T), 
                  stringsAsFactors = F)
head(df1)
#>   pebble bucket
#> 1      1      D
#> 2      2      C
#> 3      3      F
#> 4      4      A
#> 5      5      E
#> 6      6      E

I want to randomly re-assign pebbles to buckets. Without restrictions I could do it like so:

random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
colSums(table(random.permutation.df1))
#>  A  B  C  D  E  F  G  H  I  J 
#>  4  7 13 14 12 11 11 10  9  9
colSums(table(df1))
#>  A  B  C  D  E  F  G  H  I  J 
#>  4  7 13 14 12 11 11 10  9  9

Importantly this re-assigns pebbles while ensuring that each bucket retains the same number (because we are sampling without replacement).

However, I have a set of restrictions such that certain pebbles cannot be assigned to certain buckets. I encode the restrictions in df2:

df2 <- data.frame(pebble = sample(1:100, 10), 
                  bucket = sample(LETTERS[1:10], 10, T), 
                  stringsAsFactors = F)
df2
#>    pebble bucket
#> 1      33      I
#> 2      39      I
#> 3       5      A
#> 4      36      C
#> 5      55      J
#> 6      66      A
#> 7      92      J
#> 8      95      H
#> 9       2      C
#> 10     49      I

The logic here is that pebbles 33 and 39 cannot be placed in bucket I, or pebble 5 in bucket A, etc. I would like to permute which pebbles are in which bucket subject to these restrictions.

So far, I've thought of tackling it in a loop as below, but this does not result in buckets retaining the same number of pebbles:

perms <- character(0)
cnt <- 1
for (p in df1$pebble) {
  perms[cnt] <- sample(df1$bucket[!df1$bucket %in% df2$bucket[df2$pebble==p]], 1)
  cnt <- cnt + 1
}
table(perms)
#> perms
#>  A  B  C  D  E  F  G  H  I  J 
#>  6  7 12 22 15  1 14  7  7  9

I then tried sampling positions, and then removing that position from the available buckets and the available remaining positions. This is also not working, and I suspect it is because I am sampling my way into branches of the tree that do not yield solutions.

set.seed(42)
perms <- character(0)
cnt <- 1
ids <- 1:nrow(df1)
bckts <- df1$bucket
for (p in df1$pebble) {
  id <- sample(ids[!bckts %in% df2$bucket[df2$pebble==p]], 1)
  perms[cnt] <- bckts[id]
  bckts <- bckts[-id]
  ids <- ids[ids!=id]
  cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G J 
#> 1 1 4 1 2 1 2 2

Any thoughts or advice much appreciated (and apologies for the length).

EDIT:

I foolishly forgot to clarify that I was previously solving this by just resampling until I got a draw that didn't violate any of the conditions in df2, but I now have many conditions such that this would make my code take too long to run. I am still up for trying to force it if I could figure out a way to make forcing it faster.

618

asked May 03 '18 16:05

gfgm

1 Answers

I have a solution (I managed to write it in base R, but the data.table solution is easier to understand and write:

random.permutation.df2 <- data.frame(pebble = df1$pebble, bucket = rep(NA,length(df1$pebble)))
for(bucket in unique(df1$bucket)){
  N <-  length( random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) & 
                                         !random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] )
  random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) & 
                                  !random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] <- 
    sample(c(rep(bucket,sum(df1$bucket == bucket)),rep(NA,N-sum(df1$bucket == bucket))))

}

The idea is to sample the authorised peeble for each bucket: those that are not in df2, and those that are not already filled. You sample then a vector of the good length, choosing between NAs (for the following buckets values) and the value in the loop, and voilà.

Now easier to read with data.table

library(data.table)
random.permutation.df2 <- setDT(random.permutation.df2)
df2 <- setDT(df2)

for( bucketi in unique(df1$bucket)){
 random.permutation.df2[is.na(bucket) & !pebble %in% df2[bucket == bucketi, pebble], 
                        bucket := sample(c(rep(bucketi,sum(df1$bucket == bucket)),rep(NA,.N-sum(df1$bucket == bucket))))] 
}

it has the two conditions

> colSums(table(df1))
 A  B  C  D  E  F  G  H  I  J 
 4  7 13 14 12 11 11 10  9  9 
> colSums(table(random.permutation.df2))
 A  B  C  D  E  F  G  H  I  J 
 4  7 13 14 12 11 11 10  9  9

To verify that there isn't any contradiction with df2

> df2
    pebble bucket
 1:     37      D
 2:     95      H
 3:     90      C
 4:     80      C
 5:     31      D
 6:     84      G
 7:     76      I
 8:     57      H
 9:      7      E
10:     39      A
> random.permutation.df2[pebble %in% df2$pebble,.(pebble,bucket)]
    pebble bucket
 1:      7      D
 2:     31      H
 3:     37      J
 4:     39      F
 5:     57      B
 6:     76      E
 7:     80      F
 8:     84      B
 9:     90      H
10:     95      D

189

answered Oct 12 '22 12:10

denis

Related questions
                            
                                Run Maven archetype "quickstart" in Java 10?
                            
                                Can both unapply and unapplySeq be defined in the same extractor
                            
                                A solution idea for incremental updates using browser cache
                            
                                Customizing the material theme in ag-Grid doesn't use the accent-color for the checkbox
                            
                                Field error in object 'user' on field 'userProfiles': rejected value [3];
                            
                                Does JS's spread syntax appear in other languages?
                            
                                Does reading or writing a whole 32-bit word, even though we only have a reference to a part of it, result in undefined behaviour?
                            
                                Serilog Destructure.ByTransforming<Interface>() not working
                            
                                Heroku commands throwing this error on mac: !error getting commands pid 29989 SIGSEGV (signal 11)
                            
                                Detect page change with javascript?
                            
                                Returning unique_ptr<Object> as unique_ptr<const Object>
                            
                                Shapely unable to split line on point due to precision issues

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With