I have a matrix in which each row is a sample from a distribution. I want to do a rolling comparison of the distributions using ks.test
and save the test statistic in each case. The simplest way to implement this conceptually is with a loop:
set.seed(1942)
mt <- rbind(rnorm(5), rnorm(5), rnorm(5), rnorm(5))
results <- matrix(as.numeric(rep(NA, nrow(mt))))
for (i in 2 : nrow(mt)) {
results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
However, my real data has ~400 columns and ~300,000 rows for a single example, and I have a lot of examples. So I'd like this to be fast. The Kolmogorov-Smirnov test isn't all that mathematically complicated, so if the answer is "implement it in Rcpp
," I'll grudgingly accept that, but I'd be somewhat surprised -- it's already very fast to compute on a single pair in R.
Methods I've tried but have been unable to get working: dplyr
using rowwise/do/lag
, zoo
using rollapply
(which is what I use to generate the distributions), and populating a data.table
in a loop (edit: this one works, but it's still slow).
A quick and dirty implementation in Rcpp
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
double KS(arma::colvec x, arma::colvec y) {
int n = x.n_rows;
arma::colvec w = join_cols(x, y);
arma::uvec z = arma::sort_index(w);
w.fill(-1); w.elem( find(z <= n-1) ).ones();
return max(abs(cumsum(w)))/n;
}
// [[Rcpp::export]]
Rcpp::NumericVector K_S(arma::mat mt) {
int n = mt.n_cols;
Rcpp::NumericVector results(n);
for (int i=1; i<n;i++) {
arma::colvec x=mt.col(i-1);
arma::colvec y=mt.col(i);
results[i] = KS(x, y);
}
return results;
}
for matrix of size (400, 30000)
, it completes under 1s.
system.time(K_S(t(mt)))[3]
#elapsed
# 0.98
And the result seems to be accurate.
set.seed(1942)
mt <- matrix(rnorm(400*30000), nrow=30000)
results <- rep(0, nrow(mt))
for (i in 2 : nrow(mt)) {
results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
result <- K_S(t(mt))
all.equal(result, results)
#[1] TRUE
One source of speed up is to write a smaller version of ks.test
that does less. ks.test2
below is more restrictive than ks.test
. It assumes, for example, that you have no missing values and that you always want the statistic associated with a two-sided test.
ks.test2 <- function(x, y){
n.x <- length(x)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
max(abs(z))
}
Verify that the output is consistent with ks.test
.
set.seed(999)
x <- rnorm(400)
y <- rnorm(400)
ks.test(x, y)$statistic
D
0.045
ks.test2(x, y)
[1] 0.045
Now determine the savings from the smaller function:
library(microbenchmark)
microbenchmark(
ks.test(x, y),
ks.test2(x, y)
)
Unit: microseconds
expr min lq mean median uq max neval cld
ks.test(x, y) 1030.238 1070.303 1347.3296 1227.207 1313.8490 6338.918 100 b
ks.test2(x, y) 709.719 730.048 832.9532 833.861 888.5305 1281.284 100 a
I was able to compute the pairwise Kruskal-Wallis statistic using ks.test()
with rollapplyr()
.
results <- rollapplyr(data = big,
width = 2,
FUN = function(x) ks.test(x[1, ], x[2, ])$statistic,
by.column = FALSE)
This gets the expected result, but it's slow for a dataset of your size. Slow slow slow. This may be because ks.test()
is computing a lot more than just the statistic at each iteration; it also gets the p-value and does a lot of error checking.
Indeed, if we simulate a large dataset like so:
big <- NULL
for (i in 1:400) {
big <- cbind(big, rnorm(300000))
}
The rollapplyr()
solution takes a long time; I halted execution after about 2 hours, at which point it had computed nearly all (but not all) results.
It seems that while rollapplyr()
is likely faster than a for
loop, it will not likely be the best overall solution in terms of performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With