I have a n x m matrix mat and a vector vec of length m. Is there a fast way to subtract this vector from each row of the matrix, recycling it as necessary.
Example:
mat = structure(c(3.01, 1.44, 3.31, 1.34, 3.79, 1.65, 3.06, 1.12, 2.34,
0.27, 2.63, 0.63, 2.73, 0.94, 3.1, 1.34, 2.75, 0.75, 2.83, 0.58
), .Dim = c(2L, 10L))
vec = colMeans(mat)
Whats the fastest way to subtract vec from each row of mat? aperm seems very inefficient.
aperm(aperm(mat) - vec)
Allocating a second matrix doesn't seem too spiffy either.
mat - matrix(vec, ncol=ncol(mat), nrow = nrow(mat), by.row =T)
Note: This question is a repost of a previous question closed as a duplicate. Unfortunately the duplicates lacked relevant, thorough answers so I decided to add it along with an answer.
library(microbenchmark)
byrow.speed.benchmark = function(ncol, nrow) {
mat = matrix(rnorm(nrow * ncol), nrow = nrow, ncol = ncol)
vec = colSums(mat)
microbenchmark(
aperm(aperm(mat) - vec),
t(t(mat) - vec),
mat - matrix(vec, ncol=ncol(mat), nrow = nrow(mat), byrow =T),
sweep(mat, 2, vec),
mat - rep(vec, each = nrow(mat)),
#mat %*% diag(vec),
mat - vec[col(mat)],
mat - vec,
times = 300
)
}
byrow.speed.benchmark(10, 10)
Comparing several methods of applying across matrix rows we find that allocating a vector is the fastest.
Unit: nanoseconds
expr min lq mean median uq max neval
aperm(aperm(mat) - vec) 8642 9283 10214.287 9923 10243 80344 300
t(t(mat) - vec) 6722 7362 7950.130 8002 8323 27208 300
mat - matrix(vec, ncol = ncol(mat), nrow = nrow(mat), byrow = T) 3201 3841 4282.947 4161 4482 20486 300
sweep(mat, 2, vec) 26888 28489 30016.310 29448 30089 85145 300
mat - rep(vec, each = nrow(mat)) 2560 3201 3481.630 3521 3841 10883 300
mat - vec[col(mat)] 1600 2241 2594.970 2561 2881 6081 300
mat - vec 0 320 389.530 320 321 1921 300
How does this scale?
ncols = floor(10^((4:12)/4))
nrows = floor(10^((4:12)/4))
results = cbind(expand.grid(ncols, nrows), aperm = NA, t=NA, alloc = NA, sweep = NA, rep = NA, indices=NA, control = NA)
for (i in seq(nrow(results))) {
df = byrow.speed.benchmark(results[i,1], results[i,2])
results[i,3:9] = sapply(split(df$time, as.numeric(df$expr)), mean)
}
library(ggplot2)
df = reshape2::melt(results, id.vars= c("Var1", "Var2"))
colnames(df) = c("ncol", "nrow", "method", "meantime")
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + ggtitle("Scaling with cell number.") + coord_cartesian(ylim = c(0, 1E6))
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + ggtitle("Scaling with cell number.") #+ coord_cartesian(ylim = c(0, 5E7))
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + coord_cartesian(ylim = c(0, 3E7)) + ggtitle("Scaling with a wide matrix (1000 columns)")
ggplot(subset(df, nrow==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + coord_cartesian(ylim = c(0, 3E7)) + ggtitle("Scaling with a tall matrix (1000 rows)")
The pink line is the case where we apply the vector over the columns with built in recycling. Allocating a matrix with matrix(vec, byrow=T) scales the best of our options.

On the off chance that the matrix dimensions affected this here is scaling for a wide and a tall matrix.

Edit: It's worth noting that (as expected) the matrix allocation does not scale as well as vector recycling. The above plots are slightly misleading in that regard.

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With