Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster method for making paired comparisons than expand.grid in Base R?

Colleagues,

I want to estimate the likelihood that a randomly-selected item from one group will have a higher score than a randomly-selected item from a different group. That is the Probability of Superiority, sometimes called the Common-language Effect Size See for example: https://rpsychologist.com/d3/cohend/. This can be resolved algebraically if we accept that the data are normally distributed (McGraw and Wong (1992, Psychological Bulletin, 111) but I know that my data are not normally distributed, making such an estimate unreliable.

My solution is simple brute force. Using sample data, we compare each observation in one group with each observation in the other group, counting frequency where a>b and halving any ties.

My first attempt was a For If Else loop, and looked like this:

# Generate sample data
alpha <- as.integer(runif(2000, min = 0, max = 100))
beta <- as.integer(runif(2000, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)

#initialise vars
bigger  <- 0
equal   <- 0.0
smaller <- 0

# Loop
for (i in alpha) {
  for (j in beta) {
    if (i > j) {bigger <- bigger + 1}
    else
      if (i < j) {smaller <- smaller + 1}
    else
    {equal <- equal + 0.5}
  }
}

# record Probability a > b
PaGTb <- (bigger + equal) / nPairs

This does the job, but it's woefully slow, especially in a Shiny app.

A work colleague reminded me that R is vector based, and suggested the expand.grid function. So my second attempt looks like this:

# generating sample data
alpha <- as.integer(runif(2000, min = 0, max = 100))
beta <- as.integer(runif(2000, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)

# Each observation in alpha is paired with each observation in beta
c <- expand.grid(a = alpha, b = beta)

# Count frequency of a > b
aGTb <- length(which(c$a > c$b))
aEQb <- length(which(c$a == c$b))
aGTb <- aGTb + (0.5 * aEQb)

# record Probability a > b
PaGTb <- aGTb / nPairs

The speed is noticeably faster!

But not a quantum step faster. With simulations, I found that the vector-based method is faster for many paired comparisons (millions), but the For-If-Else method is faster for relatively few comparisons (thousands). The following table summarises results for 1000 replications of 3334*3000=10,002,000 paired comparisons on my iMac.

     time_ForIfElse    time_Vector     Ratio_Vector/ForIfElse
     Min.   :0.3514   Min.   :0.2835   Min.   :0.2713  
     1st Qu.:0.3648   1st Qu.:0.2959   1st Qu.:0.7818  
     Median :0.3785   Median :0.3102   Median :0.8115  
     Mean   :0.4037   Mean   :0.3322   Mean   :0.8309  
     3rd Qu.:0.4419   3rd Qu.:0.3678   3rd Qu.:0.8668  
     Max.   :1.4124   Max.   :0.5896   Max.   :1.4726  

So, for the kind of data I'm working with, the vector-based method is about 20% faster than my original approach. But I'm still thinking that there may be a faster way of dealing with this problem.

Any thoughts from the community?

like image 946
WinzarH Avatar asked Dec 11 '22 02:12

WinzarH


1 Answers

Here is a complete base solution that relies on table() and is inspired by @chinsoon12.

f_int_AUC <- function(a, b){
  A <- table(a)
  A_Freq = as.vector(A)
  A_alpha = as.integer(names(A))

  B <- table(b)
  B_Freq = as.vector(B)
  B_beta = as.integer(names(B))

  bigger = 0
  equal = 0

  for (i in seq_along(A_alpha)){
    bigger = bigger + sum(A_Freq[i] * B_Freq[A_alpha[i] > B_beta])
    equal = equal + 0.5 * sum(A_Freq[i] * B_Freq[A_alpha[i] == B_beta])
  }

  (bigger + equal) / (length(a) * length(b))
}

It's important to use this as a function - it is around 8 times faster on 4,000 rows than an uncompiled loop.

Unit: milliseconds
              expr    min      lq     mean  median      uq     max neval
   base_int_AUC_fx 1.0187 1.03865 1.146774 1.10930 1.13230  5.3215   100
      base_int_AUC 8.3071 8.43380 9.309865 8.53855 8.88005 40.3327   100

Profiling of the function showed that the table() call was slowing this solution down. To address it, tabulate() is incorporated now for significant speed gains. The downside of tabulate() is that it doesn't name its bins. Therefore, we need to normalize the bins (h/t to @chinsoon12 for an additional 20% improvement):

f_int_AUC2 <- function(a,b) {
# tabulate does not include 0, therefore we need to
# normalize to positive integers.
  m <- min(c(a,b))

  A_Freq = tabulate(a - min(m) + 1)
  B_Freq = tabulate(b - min(m) + 1)

# calculate the outer product.  
  out_matrix <- outer(A_Freq, B_Freq)
  bigger = sum(out_matrix[lower.tri(out_matrix)])

  equal = 0.5 * sum(diag(out_matrix))

  (bigger + equal) / length(a) / length(b)
}

One thing to note that using table() or tabulate() makes assumption and would not work well on floating point numbers. Based on @Aaron's suggestion, I looked at the wilcox.text code and made a few simplifications based on the question. Note I extracted the code from the function - the wilcox.test() function is 437 lines of code - this is only 4 lines of it which provides significant speed increases.

f_wilcox_AUC <- function(a, b){
  # r <- data.table::frank(c(a, b)) #is much faster
  r <- rank(c(a, b))

  n.x <- as.integer(length(a))
  n.y <- as.integer(length(b))

# from wilcox.test STATISTIC; I am not a statistician and do not follow
# see reference at end or Google "r wilcox.test code"
  {sum(r[seq_along(a)]) - n.x * (n.x + 1) / 2} / n.x / n.y
}

Performance:

4,000 x 4,000

Unit: microseconds
           expr   min    lq  mean median    uq   max neval
 bigstats_prive   719   748   792    797   817   884    10
    chinsoon_DT  5910  6050  6280   6210  6350  7180    10
   base_int_AUC  1060  1070  2660   1130  1290 16300    10
  base_int_AUC2   237   250  1430    256   266 12000    10
    wilcox_base  1050  1050  1830   1060  1070  8790    10
      wilcox_dt   500   524  2940    530   603 16800    10
    wilcox_test 11300 11400 11900  11500 11700 13400    10

40,000 x 40,000

Unit: milliseconds
           expr    min     lq   mean median     uq    max neval
 bigstats_prive   4.03   4.07   5.13   4.11   4.27  66.40   100
    chinsoon_DT   6.62   7.01   7.88   7.44   7.75  19.20   100
   base_int_AUC   4.53   4.60   5.96   4.68   4.81  93.40   100
  base_int_AUC2   1.03   1.07   1.14   1.08   1.12   3.70   100
    wilcox_base  13.70  13.80  14.10  13.80  14.00  26.50   100
      wilcox_dt   1.87   2.00   2.38   2.11   2.23   6.25   100
    wilcox_test 115.00 118.00 121.00 121.00 124.00 135.00   100

400,000 x 400,000

Unit: milliseconds
           expr    min     lq   mean median     uq    max neval
 bigstats_prive   50.3   54.0   55.0   54.7   56.8   58.6    10
    chinsoon_DT   19.8   20.9   22.6   22.7   23.7   27.3    10
   base_int_AUC   43.6   45.3   53.3   47.8   49.6  108.0    10
  base_int_AUC2   10.0   10.4   11.8   10.7   13.8   14.8    10
    wilcox_base  252.0  254.0  271.0  258.0  293.0  316.0    10
      wilcox_dt   19.4   20.8   22.1   22.6   23.6   24.2    10
    wilcox_test 1440.0 1460.0 1480.0 1480.0 1500.0 1520.0    10

In all, using the base function will take 210 ms to do 1,000 replications of 4,000 X 4,000 comparisons:

# A tibble: 7 x 13
  expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory             time     gc                  
  <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>             <list>   <list>              
1 bigstats_prive  680.9us  692.8us    1425.   410.76KB     8.60   994     6   697.45ms <dbl [1]> <df[,3] [12 x 3]>  <bch:tm> <tibble [1,000 x 3]>
2 chinsoon_DT      5.55ms   6.02ms     166.   539.92KB     4.78   972    28      5.85s <dbl [1]> <df[,3] [82 x 3]>  <bch:tm> <tibble [1,000 x 3]>
3 base_int_AUC    981.8us   1.02ms     958.   750.44KB    10.7    989    11      1.03s <dbl [1]> <df[,3] [606 x 3]> <bch:tm> <tibble [1,000 x 3]>
4 base_int_AUC2   203.7us  209.3us    4743.   454.36KB    33.4    993     7   209.38ms <dbl [1]> <df[,3] [22 x 3]>  <bch:tm> <tibble [1,000 x 3]>
5 wilcox_base      1.03ms   1.04ms     959.   359.91KB     4.82   995     5      1.04s <dbl [1]> <df[,3] [11 x 3]>  <bch:tm> <tibble [1,000 x 3]>
6 wilcox_dt       410.4us  444.5us    2216.   189.09KB     6.67   997     3   450.01ms <dbl [1]> <df[,3] [9 x 3]>   <bch:tm> <tibble [1,000 x 3]>
7 wilcox_test     11.35ms  11.44ms      85.6    1.16MB     1.66   981    19     11.45s <dbl [1]> <df[,3] [58 x 3]>  <bch:tm> <tibble [1,000 x 3]>

edit: see previous methods for inefficient methods using outer() and data.table::CJ()

References: https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/wilcox.test.R

like image 143
Cole Avatar answered Jan 19 '23 01:01

Cole