Colleagues,
I want to estimate the likelihood that a randomly-selected item from one group will have a higher score than a randomly-selected item from a different group. That is the Probability of Superiority, sometimes called the Common-language Effect Size See for example: https://rpsychologist.com/d3/cohend/. This can be resolved algebraically if we accept that the data are normally distributed (McGraw and Wong (1992, Psychological Bulletin, 111) but I know that my data are not normally distributed, making such an estimate unreliable.
My solution is simple brute force. Using sample data, we compare each observation in one group with each observation in the other group, counting frequency where a>b and halving any ties.
My first attempt was a For If Else loop, and looked like this:
# Generate sample data
alpha <- as.integer(runif(2000, min = 0, max = 100))
beta <- as.integer(runif(2000, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)
#initialise vars
bigger <- 0
equal <- 0.0
smaller <- 0
# Loop
for (i in alpha) {
for (j in beta) {
if (i > j) {bigger <- bigger + 1}
else
if (i < j) {smaller <- smaller + 1}
else
{equal <- equal + 0.5}
}
}
# record Probability a > b
PaGTb <- (bigger + equal) / nPairs
This does the job, but it's woefully slow, especially in a Shiny app.
A work colleague reminded me that R is vector based, and suggested the expand.grid
function. So my second attempt looks like this:
# generating sample data
alpha <- as.integer(runif(2000, min = 0, max = 100))
beta <- as.integer(runif(2000, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)
# Each observation in alpha is paired with each observation in beta
c <- expand.grid(a = alpha, b = beta)
# Count frequency of a > b
aGTb <- length(which(c$a > c$b))
aEQb <- length(which(c$a == c$b))
aGTb <- aGTb + (0.5 * aEQb)
# record Probability a > b
PaGTb <- aGTb / nPairs
The speed is noticeably faster!
But not a quantum step faster. With simulations, I found that the vector-based method is faster for many paired comparisons (millions), but the For-If-Else method is faster for relatively few comparisons (thousands). The following table summarises results for 1000 replications of 3334*3000=10,002,000 paired comparisons on my iMac.
time_ForIfElse time_Vector Ratio_Vector/ForIfElse
Min. :0.3514 Min. :0.2835 Min. :0.2713
1st Qu.:0.3648 1st Qu.:0.2959 1st Qu.:0.7818
Median :0.3785 Median :0.3102 Median :0.8115
Mean :0.4037 Mean :0.3322 Mean :0.8309
3rd Qu.:0.4419 3rd Qu.:0.3678 3rd Qu.:0.8668
Max. :1.4124 Max. :0.5896 Max. :1.4726
So, for the kind of data I'm working with, the vector-based method is about 20% faster than my original approach. But I'm still thinking that there may be a faster way of dealing with this problem.
Any thoughts from the community?
Here is a complete base solution that relies on table()
and is inspired by @chinsoon12.
f_int_AUC <- function(a, b){
A <- table(a)
A_Freq = as.vector(A)
A_alpha = as.integer(names(A))
B <- table(b)
B_Freq = as.vector(B)
B_beta = as.integer(names(B))
bigger = 0
equal = 0
for (i in seq_along(A_alpha)){
bigger = bigger + sum(A_Freq[i] * B_Freq[A_alpha[i] > B_beta])
equal = equal + 0.5 * sum(A_Freq[i] * B_Freq[A_alpha[i] == B_beta])
}
(bigger + equal) / (length(a) * length(b))
}
It's important to use this as a function - it is around 8 times faster on 4,000 rows than an uncompiled loop.
Unit: milliseconds
expr min lq mean median uq max neval
base_int_AUC_fx 1.0187 1.03865 1.146774 1.10930 1.13230 5.3215 100
base_int_AUC 8.3071 8.43380 9.309865 8.53855 8.88005 40.3327 100
Profiling of the function showed that the table()
call was slowing this solution down. To address it, tabulate()
is incorporated now for significant speed gains. The downside of tabulate()
is that it doesn't name its bins. Therefore, we need to normalize the bins (h/t to @chinsoon12 for an additional 20% improvement):
f_int_AUC2 <- function(a,b) {
# tabulate does not include 0, therefore we need to
# normalize to positive integers.
m <- min(c(a,b))
A_Freq = tabulate(a - min(m) + 1)
B_Freq = tabulate(b - min(m) + 1)
# calculate the outer product.
out_matrix <- outer(A_Freq, B_Freq)
bigger = sum(out_matrix[lower.tri(out_matrix)])
equal = 0.5 * sum(diag(out_matrix))
(bigger + equal) / length(a) / length(b)
}
One thing to note that using table()
or tabulate()
makes assumption and would not work well on floating point numbers. Based on @Aaron's suggestion, I looked at the wilcox.text
code and made a few simplifications based on the question. Note I extracted the code from the function - the wilcox.test()
function is 437 lines of code - this is only 4 lines of it which provides significant speed increases.
f_wilcox_AUC <- function(a, b){
# r <- data.table::frank(c(a, b)) #is much faster
r <- rank(c(a, b))
n.x <- as.integer(length(a))
n.y <- as.integer(length(b))
# from wilcox.test STATISTIC; I am not a statistician and do not follow
# see reference at end or Google "r wilcox.test code"
{sum(r[seq_along(a)]) - n.x * (n.x + 1) / 2} / n.x / n.y
}
Performance:
4,000 x 4,000
Unit: microseconds
expr min lq mean median uq max neval
bigstats_prive 719 748 792 797 817 884 10
chinsoon_DT 5910 6050 6280 6210 6350 7180 10
base_int_AUC 1060 1070 2660 1130 1290 16300 10
base_int_AUC2 237 250 1430 256 266 12000 10
wilcox_base 1050 1050 1830 1060 1070 8790 10
wilcox_dt 500 524 2940 530 603 16800 10
wilcox_test 11300 11400 11900 11500 11700 13400 10
40,000 x 40,000
Unit: milliseconds
expr min lq mean median uq max neval
bigstats_prive 4.03 4.07 5.13 4.11 4.27 66.40 100
chinsoon_DT 6.62 7.01 7.88 7.44 7.75 19.20 100
base_int_AUC 4.53 4.60 5.96 4.68 4.81 93.40 100
base_int_AUC2 1.03 1.07 1.14 1.08 1.12 3.70 100
wilcox_base 13.70 13.80 14.10 13.80 14.00 26.50 100
wilcox_dt 1.87 2.00 2.38 2.11 2.23 6.25 100
wilcox_test 115.00 118.00 121.00 121.00 124.00 135.00 100
400,000 x 400,000
Unit: milliseconds
expr min lq mean median uq max neval
bigstats_prive 50.3 54.0 55.0 54.7 56.8 58.6 10
chinsoon_DT 19.8 20.9 22.6 22.7 23.7 27.3 10
base_int_AUC 43.6 45.3 53.3 47.8 49.6 108.0 10
base_int_AUC2 10.0 10.4 11.8 10.7 13.8 14.8 10
wilcox_base 252.0 254.0 271.0 258.0 293.0 316.0 10
wilcox_dt 19.4 20.8 22.1 22.6 23.6 24.2 10
wilcox_test 1440.0 1460.0 1480.0 1480.0 1500.0 1520.0 10
In all, using the base function will take 210 ms to do 1,000 replications of 4,000 X 4,000 comparisons:
# A tibble: 7 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 bigstats_prive 680.9us 692.8us 1425. 410.76KB 8.60 994 6 697.45ms <dbl [1]> <df[,3] [12 x 3]> <bch:tm> <tibble [1,000 x 3]>
2 chinsoon_DT 5.55ms 6.02ms 166. 539.92KB 4.78 972 28 5.85s <dbl [1]> <df[,3] [82 x 3]> <bch:tm> <tibble [1,000 x 3]>
3 base_int_AUC 981.8us 1.02ms 958. 750.44KB 10.7 989 11 1.03s <dbl [1]> <df[,3] [606 x 3]> <bch:tm> <tibble [1,000 x 3]>
4 base_int_AUC2 203.7us 209.3us 4743. 454.36KB 33.4 993 7 209.38ms <dbl [1]> <df[,3] [22 x 3]> <bch:tm> <tibble [1,000 x 3]>
5 wilcox_base 1.03ms 1.04ms 959. 359.91KB 4.82 995 5 1.04s <dbl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [1,000 x 3]>
6 wilcox_dt 410.4us 444.5us 2216. 189.09KB 6.67 997 3 450.01ms <dbl [1]> <df[,3] [9 x 3]> <bch:tm> <tibble [1,000 x 3]>
7 wilcox_test 11.35ms 11.44ms 85.6 1.16MB 1.66 981 19 11.45s <dbl [1]> <df[,3] [58 x 3]> <bch:tm> <tibble [1,000 x 3]>
edit: see previous methods for inefficient methods using outer()
and data.table::CJ()
References: https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/wilcox.test.R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With