Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Vectorize loop to create pairwise matrix

I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.

Here is an example df:

  df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
                   Fruit = c("apple", "orange", "pear",
                             "orange", "pear",
                             "pear", "apple",
                             "pear", "apple", "orange",
                             "pear", "apple", "orange"),
                   Order = c(1, 2, 3,
                            1, 2,
                            1, 2, 
                            1, 2, 3,
                            1, 1, 1))

In each Shop, Fruit is picked by a customer in a given Order.

The following function creates an m x n pairwise matrix:

loop.function <- function(df){
  
  fruits <- unique(df$Fruit)
  nt <- length(fruits)
  mat <- array(dim=c(nt,nt))
  
  for(m in 1:nt){
    
    for(n in 1:nt){
      
      ## filter df for each pair of fruit
      xm <- df[df$Fruit == fruits[m],]
      xn <- df[df$Fruit == fruits[n],]
      
      ## index instances when a pair of fruit are picked in same shop
      mm <- match(xm$Shop, xn$Shop)
      
      ## filter xm and xn based on mm
      xm <- xm[! is.na(mm),]
      xn <- xn[mm[! is.na(mm)],]
      
      ## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
      mat[m,n] <- sum(xn$Order < xm$Order)
    }
  }
  
  row.names(mat) <- fruits
  colnames(mat) <- fruits
  
  return(mat)
}

Where mat[m,n] is the number of times fruits[m] is picked after fruits[n]. And mat[n,m] is the number of times fruits[m] is picked before fruits[n]. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop E).

See expected output:

>loop.function(df)
       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

You can see here that pear is chosen twice before apple (in Shop C and D), and apple is chosen once before pear (in Shop A).

I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.

(I have a feeling there may be a solution using outer(), but my knowledge of vectorizing functions is still very limited.)

Update

See benchmarking with real data times = 10000 for loop.function(), tidyverse.function(), loop.function2(), datatable.function() and loop.function.TMS():

Unit: milliseconds
                    expr            min        lq       mean    median         uq      max     neval   cld
      loop.function(dat)     186.588600 202.78350 225.724249 215.56575 234.035750 999.8234    10000     e
     tidyverse.function(dat)  21.523400  22.93695  26.795815  23.67290  26.862700 295.7456    10000   c 
     loop.function2(dat)     119.695400 126.48825 142.568758 135.23555 148.876100 929.0066    10000    d
 datatable.function(dat)       8.517600   9.28085  10.644163   9.97835  10.766749 215.3245    10000  b 
  loop.function.TMS(dat)       4.482001   5.08030   5.916408   5.38215   5.833699  77.1935    10000 a 

Probably the most interesting result for me is the performance of tidyverse.function() on the real data. I will have to try add Rccp solutions at a later date - I'm having trouble making them work on the real data.

I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!

like image 282
jayb Avatar asked Jul 08 '20 12:07

jayb


People also ask

How to create a combination of pairs from a vector in R?

How to create a combination of pairs from a vector in R? To create a combination of pairs, we can use the combn function. This function will be helpful to use the vector length and the value 2 that represents a pair to create the combinations. Although, this is enough but we can transpose the output to get a better−looking output.

How to vectorize a function in R?

base::Vectorize () converts a scalar function to a vector function. base::Vectorize () is a base R function that vectorized our non-vectorized if_else_statement () scalar function. Another good way to vectorize functions would be with the purrr package.

Is there a pairwise function for correlation in R?

In some cases there may be a pairwise implementation already available, e.g. R’s function for computing correlations 4. In other cases one may not exist or is not easy to use 5. In this post I’ll walk through an example 6 explaining code and steps for setting-up arbitrary pairwise operations across sets of variables. II.

How many vector elements are in a for loop in R?

Have a look at the previously shown output of the RStudio console. It shows that our exemplifying vector consists of six numeric vector elements. This Example illustrates how to write and run a for-loop over vector elements in R. Within the body of the loop, we are creating some output and we are printing this output to the RStudio console:


2 Answers

A data.table solution :

library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
           ,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
      ,Fruit~i.Fruit,value.var='cnt')

    Fruit apple orange pear
1:  apple     0      0    2
2: orange     2      0    1
3:   pear     1      2    0

The Shop index isn't necessary for this example, but will probably improve performance on a larger dataset.

As the question raised many comments on performance, I decided to check what Rcpp could bring:

library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {

std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n  = FruitLevels.length();

std::string currentShop = "";
int order, fruit, i, f;

NumericMatrix result(n,n);
NumericVector fruitOrder(n);

for (i=0;i<Fruit.length();i++){
    if (currentShop != Shop[i]) {
       //Init counter for each shop
       currentShop = Shop[i];
       std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
    }
    order = Order[i];
    fruit = FruitInt[i];
    fruitOrder[fruit-1] = order;
    for (f=0;f<n;f++) {
       if (order > fruitOrder[f] & fruitOrder[f]>0 ) { 
         result(fruit-1,f) = result(fruit-1,f)+1; 
    }
  }
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')

rcppPair(df)

       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

On the example dataset, this runs >500 times faster than the data.table solution, probably because it doesn't have the cartesian product problem. This isn't supposed to be robust on wrong input, and expects that shops / order are in ascending order.

Considering the few minutes spent to find the 3 lines of code for the data.table solution, compared to the much longer Rcpp solution / debugging process, I wouldn't recommend to go for Rcpp here unless there's a real performance bottleneck.

Interesting however to remember that if performance is a must, Rcpp might be worth the effort.

like image 174
Waldi Avatar answered Oct 08 '22 07:10

Waldi


Here is an approach that makes simple modifications to make it 5x faster.

loop.function2 <- function(df){

    spl_df = split(df[, c(1L, 3L)], df[[2L]])
    
    mat <- array(0L,
                 dim=c(length(spl_df), length(spl_df)),
                 dimnames = list(names(spl_df), names(spl_df)))
    
    for (m in 1:(length(spl_df) - 1L)) {
        xm = spl_df[[m]]
        mShop = xm$Shop
        for (n in ((1+m):length(spl_df))) {
            xn = spl_df[[n]]
            mm = match(mShop, xn$Shop)
            inds = which(!is.na(mm))
            mOrder = xm[inds, "Order"]
            nOrder = xn[mm[inds], "Order"]

            mat[m, n] <- sum(nOrder < mOrder)
            mat[n, m] <- sum(mOrder < nOrder)
        }
    }
    mat
}

There are 3 main concepts:

  1. The original df[df$Fruits == fruits[m], ] lines were inefficient as you would be making the same comparison length(Fruits)^2 times. Instead, we can use split() which means we are only scanning the Fruits once.
  2. There was a lot of use of df$var which will extract the vector during each loop. Here, we place the assignment of xm outside of the inner loop and we try to minimize what we need to subset / extract.
  3. I changed it to be closer to combn as we can re-use our match() condition by doing both sum(xmOrder > xnOrder) and then switching it to sum(xmOrder < xnOrder).

Performance:

bench::mark(loop.function(df), loop.function2(df))

# A tibble: 2 x 13
##  expression              min median
##  <bch:expr>         <bch:tm> <bch:>
##1 loop.function(df)    3.57ms 4.34ms
##2 loop.function2(df)  677.2us 858.6us

My hunch is that for your larger dataset, @Waldi's data.table solution will be faster. But for smaller datasets, this should be pretty perfomant.

Finally, here's yet another rcpp approach that seems to be slower than @Waldi:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
    int x_size = x.size();
    IntegerMatrix ans(x_size, x_size);
    
    for (int m = 0; m < x_size - 1; m++) {
        DataFrame xm = x[m];
        CharacterVector mShop = xm[0];
        IntegerVector mOrder = xm[1];
        int nrows = mShop.size();
        for (int n = m + 1; n < x_size; n++) {
            DataFrame xn = x[n];
            CharacterVector nShop = xn[0];
            IntegerVector nOrder = xn[1];
            for (int i = 0; i < nrows; i++) {
                for (int j = 0; j < nrows; j++) {
                    if (mShop[i] == nShop[j]) {
                        if (mOrder[i] > nOrder[j])
                           ans(m, n)++;
                        else
                            ans(n, m)++;
                        break;
                    }
                }
            }
        }
    }
    return(ans);
}
loop_wrapper = function(df) {
  loop_function_cpp(split(df[, c(1L, 3L)], df[[2L]]))
}
loop_wrapper(df)
``
like image 40
Cole Avatar answered Oct 08 '22 05:10

Cole