Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorise R code to randomly select 2 columns from each row

Does anyone have suggestions of how I could vectorise this code or otherwise speed it up? I'm creating a matrix, potentially very large. In each row, I want to select 2 columns at random, and flip them from 0 to 1.

I cannot select the same row and column number, i.e. the diagonal of the matrix will be zero hence the (1:N)[-j] in sample(). Because this changes with each row, I can't see a way to do this by using vectorisation, but parallelisation could be an option here?

I use library(Matrix) for sparse matrix functionality.

library(Matrix)
N <- 100
m <- Matrix(0, nrow = N, ncol = N)

for(j in 1:N) {
    cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
    m[j, cols] <- 1
}

Any ideas?

like image 681
Sam Rogers Avatar asked Nov 21 '25 01:11

Sam Rogers


2 Answers

library(Matrix)
N <- 7

desired_output <- Matrix(0, nrow = N, ncol = N)
set.seed(1)
for(j in 1:N) {
  cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
  desired_output[j, cols] <- 1
}

# 7 x 7 sparse Matrix of class "dgCMatrix"
#                   
# [1,] . . 1 . . . 1
# [2,] . . . . 1 1 .
# [3,] . 1 . . . 1 .
# [4,] . . . . 1 . 1
# [5,] 1 . . 1 . . .
# [6,] 1 1 . . . . .
# [7,] . 1 . . 1 . .

res <- Matrix(0, nrow = N, ncol = N)
set.seed(1)
ind <- cbind(rep(1:N, each = 2), c(sapply(1:N, function(j) sample((1:N)[-j], 2))))
res[ind] <- 1

all.equal(res, desired_output)
# [1] TRUE

Quick bench:

microbenchmark::microbenchmark(
  OP = {
    desired_output <- Matrix(0, nrow = N, ncol = N)
    set.seed(1)
    for(j in 1:N) {
      cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
      desired_output[j, cols] <- 1
    }
  },
  Aurele = {
    res <- Matrix(0, nrow = N, ncol = N)
    set.seed(1)
    ind <- cbind(rep(1:N, each = 2), c(sapply(1:N, function(j) sample((1:N)[-j], 2))))
    res[ind] <- 1
  }
)

# Unit: milliseconds
#    expr       min        lq      mean    median        uq       max neval cld
#      OP 10.240969 10.509384 11.065336 10.804949 11.044846 14.903377   100   b
#  Aurele  1.185001  1.258037  1.392021  1.363503  1.434818  4.553614   100  a 
like image 130
Aurèle Avatar answered Nov 22 '25 16:11

Aurèle


EDIT: I've edited my answer to make it simpler and to include the way R and RcppArmadillo make the sampling. And now it seems to be linear with N (as I thought it would be).


There are two "problems" in your code:

  • sample((1:N)[-j], 2) is uncessary allocating memory, making your solution quadratic with N. The solution would be to use rejection sampling as N is large (so rejection will not occur often).
  • you replace indices that are not "contiguous" in the matrix.

It is true that because you have sample with no replacement, it is not straightforward to make a vectorized solution for your problem. But again, it would be possible by using rejection sampling. Here I prefer an Rcpp solution:

Rcpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix sample2(int N) {

  IntegerMatrix res(2 * N, 2);
  int j, ind1, ind2;

  for (j = 0; j < N; j++) {

    res(2 * j, 0) = res(2 * j + 1, 0) = j + 1;

    // sample first one
    do {
      ind1 = N * unif_rand();
    } while (ind1 == j);
    res(2 * j, 1) = ind1 + 1;

    // sample second one
    do {
      ind2 = N * unif_rand();
    } while (ind2 == j || ind2 == ind1);
    res(2 * j + 1, 1) = ind2 + 1;
  }

  return res;
}

R:

# table(replicate(1e5, sample2(5)))  # Verify that the sampling is OK
library(Matrix)
N <- 1000
m <- Matrix(0, nrow = N, ncol = N)
m[sample2(N)] <- 1

Benchmark:

microbenchmark::microbenchmark(
  OP = {
    desired_output <- Matrix(0, nrow = N, ncol = N)
    for(j in 1:N) {
      cols <- sample((1:N)[-j], 2) #Choose 2 columns not equal to the 
      desired_output[j, cols] <- 1
    }
  },
  Aurele = {
    res <- Matrix(0, nrow = N, ncol = N)
    ind <- cbind(rep(1:N, each = 2), c(sapply(1:N, function(j) sample((1:N)[-j], 2))))
    res[ind] <- 1
  },
  privefl = {
    m <- Matrix(0, nrow = N, ncol = N)
    m[sample2(N)] <- 1
  },
  times = 20
)

Results with N = 1000:

Unit: milliseconds
    expr        min         lq       mean     median         uq        max neval
      OP 599.996226 605.868229 618.479868 615.653853 625.908794 679.292360    20
  Aurele  12.315795  12.633971  14.183891  13.148149  15.118948  19.649716    20
 privefl   1.401824   1.493371   1.649015   1.622826   1.704273   2.520239    20

Results with N = 10,000:

Unit: milliseconds
    expr        min         lq       mean     median        uq         max neval
  Aurele 812.018743 845.434915 903.387191 863.851661 967.08294 1084.738882    20
 privefl   3.657525   4.083799   4.409226   4.239576   4.49501    6.413498    20
like image 35
F. Privé Avatar answered Nov 22 '25 17:11

F. Privé



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!