If I have a data frame as such:
df = data.frame(matrix(rnorm(100), 5000, 100))
I can use the following function to get every combination of three-term medians row-wise:
median_df = t(apply(df, 1, combn, 3, median))
The problem is, this function will take several hours to run. The culprit is median(), which takes about ten times longer to run than max() or min().
How can I speed this function up, possibly by writing a faster version of median() or working with the original data differently?
Update:
If I run the above code but only for df[,1:10] as such:
median_df = t(apply(df[,1:10], 1, combn, 3, median))
takes 29 seconds
fastMedian_df = t(apply(df[,1:10], 1, combn, 3, fastMedian))
from the package ccaPP takes 6.5 seconds
max_df = t(apply(df[,1:10], 1, combn, 3, max))
takes 2.5 seconds
So we see a significant improvement with fastMedian(). Can we still do better?
One approach to speed things up would be to note that the median of three numbers is their sum minus their max minus their min. This means we can vectorize our median calculations by handling each triple of columns once (performing the median for all rows in the same calculation) instead of handling it once for each row.
set.seed(144)
# Fully random matrix
df = matrix(rnorm(50000), 5000, 10)
original <- function(df) t(apply(df, 1, combn, 3, median))
josilber <- function(df) {
combos <- combn(seq_len(ncol(df)), 3)
apply(combos, 2, function(x) rowSums(df[,x]) - pmin(df[,x[1]], df[,x[2]], df[,x[3]]) - pmax(df[,x[1]], df[,x[2]], df[,x[3]]))
}
system.time(res.josilber <- josilber(df))
# user system elapsed
# 0.117 0.009 0.149
system.time(res.original <- original(df))
# user system elapsed
# 15.107 1.864 16.960
all.equal(res.josilber, res.original)
# [1] TRUE
The vectorization yields a 110x speedup when there are 10 columns and 5000 rows. Unfortunately I do not have a machine with enough memory to store the 808.5 million numbers in the output for your full example.
You could speed this up further by implementing a Rcpp function that takes as input the vector representation of a matrix (aka the vector obtained by reading the matrix down the columns) along with the number of rows and returns the median of each column. The function relies heavily on the std::nth_element
function, which is asymptotically linear in the number of elements you're taking a median of. (Note that I don't average the middle two values when I take the median of an even-length vector; I instead take the lower of the two).
library(Rcpp)
cppFunction(
"NumericVector vectorizedMedian(NumericVector x, int chunkSize) {
const int n = x.size() / chunkSize;
std::vector<double> input = Rcpp::as<std::vector<double> >(x);
NumericVector res(n);
for (int i=0; i < n; ++i) {
std::nth_element(input.begin()+i*chunkSize, input.begin()+i*chunkSize+chunkSize/2,
input.begin()+(i+1)*chunkSize);
res[i] = input[i*chunkSize+chunkSize/2];
}
return res;
}")
Now we just invoke this function instead of using rowSums
, pmin
and pmax
:
josilber.rcpp <- function(df) {
combos <- combn(seq_len(ncol(df)), 3)
apply(combos, 2, function(x) vectorizedMedian(as.vector(t(df[,x])), 3))
}
system.time(josilber.rcpp(df))
# user system elapsed
# 0.049 0.008 0.081
all.equal(josilber(df), josilber.rcpp(df))
# [1] TRUE
In total we therefore get a 210x speedup; 110x of the speedup is from switching from a non-vectorized application of median
to a vectorized application and the remaining 2x speedup is from switching from a combination of rowSums
, pmin
, and pmax
for computing the median in a vectorized way to a Rcpp-based approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With