Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating Random Pairs of Integers without Replacement in R

I want to draw random integer pairs without replacement (said another way I don't want any duplicate pairs). This concept sounds simple but I can't think of a quick and simple solution.

Imagine, for example, I want to generate random pairs of integers using the sequence of integer 1:4 to populate the elements of the pair. Also assume I want to generate 5 random pairs without replacement. I then want to be able to generate something like this...

     [,1] [,2]
[1,]    1    2
[2,]    2    1
[3,]    3    3
[4,]    1    4
[5,]    4    3

In the above example, there are no duplicate pairs (i.e rows). There are, however, duplicate integers within each column of the above matrix. Consequently, using sample() to generate random number for each column separately will not work.

Another seemingly potential solution which will not work for my context is to generate numerous pairs which include duplicates and then delete these duplicates retroactively. I can't do this because I will need to generate a specific number of pairs.

I'm looking for an efficient solution to this problem. This seems like such a simple question, it must have a simple solution (i.e. please no nested for loops)

Here is my ugly approach:

#This matrix maps a unique id i.e. (1:16) to a pair (i.e. the row & col of the matrix)
r.mat<-matrix(1:(4*4),4,4) 
#Drawing a random id
r.id<-sample(r.mat,5,replace=FALSE)
#Mapping the random id to a random pair
r.pair<-t(sapply(r.id, function (x) which(r.mat==x,arr.ind=TRUE)))

This will work fine for my toy example but when I want to draw a large amount of pairs from the sequence 1:10000000, it is not so great.

like image 378
Jacob H Avatar asked Apr 17 '15 01:04

Jacob H


4 Answers

The key here is not to generate all the permutations as that is very expensive memory and time wise. Since you only care about two numbers, we can do this very easily so long as the (number_of_possible_values) ^ 2 is less than the largest representable integer in double precision floating point:

size <- 1e5
samples <- 100
vals <- sample.int(size ^ 2, samples)
cbind(vals %/% size + 1, vals %% size)

Basically, we use integers to represent every possible combination of values. In our example, we sample from all the numbers up to 1e5 ^ 2, since we have 1e5 ^ 2 possible combinations of 1e5 numbers. Each of those 1e10 integers represents one of the combinations. We then decompose that integer into the two component values by taking the modulo, as the first number, and the integer division as the second.

Benchmarks:

Unit: microseconds
                   expr        min         lq       mean
  funBrodie(10000, 100)     16.457     17.188     22.052
 funRichard(10000, 100) 542513.717 640647.919 638045.215

Also, limit should be ~3x1e7, and remains relatively fast:

Unit: microseconds
                  expr    min      lq     mean median      uq    max neval
 funBrodie(1e+07, 100) 18.285 20.6625 22.88209 21.211 22.4905 77.893   100

Functions for benchmarking:

funRichard <- function(size, samples) {
  nums <- 1:size
  dt = CJ(nums, nums)
  dt[sample(1:dim(dt)[1], size = samples), ]
}
funBrodie <- function(size, samples) {
  vals <- sample.int(size ^ 2, samples)
  cbind(vals %/% size + 1, vals %% size)
}

And confirm we're doing similar things (note it's not a given these should be exactly the same, but it turns out they are):

set.seed(1)
resB <- funBrodie(1e4, 100)
set.seed(1)
resR <- unname(as.matrix(funRichard(1e4, 100)))
all.equal(resB, resR)
# TRUE
like image 157
BrodieG Avatar answered Nov 14 '22 02:11

BrodieG


First, I found how to generate the pairs on SO. However, that didn't scale so I looked in up ?combn and found the expand.grid function.

Next, I use the data.table package because it handles large data well (see it's documentation for the why).

## the data.table library does well with large data sets
library(data.table)

## Small dummy dataset
pairOne = 1:10
pairTwo = 1:2
nSamples = 3

system.time({
dt = data.table(expand.grid(pairOne, pairTwo))
dt2 = dt[sample(1:dim(dt)[1], size = nSamples), ]
})
#   user  system elapsed 
#  0.002   0.001   0.001 

## Large dummy dataset
pairOne = 1:10000
pairTwo = 1:10000
length(pairOne) * length(pairTwo)
nSamples = 1e5
system.time({
dt = data.table(expand.grid(pairOne, pairTwo))
dt2 = dt[sample(1:dim(dt)[1], size = nSamples), ]
})
#   user  system elapsed 
#  2.576   1.276   3.862 
like image 42
Richard Erickson Avatar answered Nov 14 '22 02:11

Richard Erickson


Inspired by David Robinson's initial stab:

set.seed(1)
np <- 1000 # number of elements desired
M1 <- t(combn(1:np, 2))
sam <- sample(1:nrow(M1), np, replace = FALSE)
M2 <- M1[sam,]
anyDuplicated(M2) # returns FALSE

This would use all possible entries of M1 but in a random order. Is this what you wanted?

like image 22
Bryan Hanson Avatar answered Nov 14 '22 02:11

Bryan Hanson


Here's my attempt. It doesn't look very elegant, but it's still a bit faster than @Richard Erickson's (2.0s vs 2.6s, for the same sizes). The idea is avoid creating the permutations, because that can take a lot of time and use a lot of memory. Instead, I create two random samples of IDs in the given range, and check if any row happens to be duplicated (which is very unlikely for a high range and average samples). If they are duplicated, then a new sample for column 2 is created and everything repeated.

range <- 1e8
n <- 1e5
ids1 <- sample(range, n)
ids2 <- sample(range, n)
mat1 <- cbind(ids1, ids2)
found = FALSE
while(!found) {
  if (any(duplicated(rbind(mat1, mat1[,2:1])))) {
    ids2 <- sample(range, n)
    mat1 <- cbind(ids1, ids2)
  } else {
    found=TRUE
  }
}
like image 25
Molx Avatar answered Nov 14 '22 02:11

Molx