I have a matrix (mat) with dims "13, 20000000" and the following groups
[1,] "wildtype"
[2,] "wildtype"
[3,] "wildtype"
[4,] "wildtype"
[5,] "wildtype"
[6,] "wildtype"
[7,] "wildtype"
[8,] "wildtype"
[9,] "wildtype"
[10,] "wildtype"
[11,] "mutant"
[12,] "mutant"
[13,] "mutant"
With the following R code, I run lm()
20M times on each data point .
lm(mat ~ groups)
is really quick. What's taking a long time is extracting the pvalue for each model using summary(lm1)
.
How might I speed up extracting the pvalues?
tvals_out <-'/tmp/tvals_lm.csv'
infile <- '/tmp/tempdata.dat'
con <- file(infile, "rb")
dim <- readBin(con, "integer", 2)
mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2])
close(con)
groups = factor(c(rep('wt', 10), rep('mut', 3)))
lm1 <- lm(mat ~ groups)
# This is the longest running bit
sum_lm1 <- summary(lm1)
num_pixels <- dim(mat)[2]
result_pvalues <- numeric(num_pixels)
result_pvalues <- vapply(sum_lm1, function(x) x$coefficients[,4][2], FUN.VALUE = 1)
write.table(result_pvalues, tvals_out, sep=',');
outCon <- file(tvals_out, "wb")
writeBin(result_pvalues, outCon)
close(outCon)
edit:
I've added a sample bit of data of 10 data points from the mat object
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
The following function is able to extract p-values from a fit on a 13x20,000,000 matrix (like yours) in about 25 seconds.
pvalOnly2 <- function(fit) {
# get estimates
est <- fit$coefficients[fit$qr$pivot, ]
# get R: see stats:::summary.lm to see how this is calculated
p1 <- 1L:(fit$rank)
R <- diag(chol2inv(fit$qr$qr[p1, p1, drop = FALSE]))
# get residual sum of squares for each
resvar <- colSums(fit$residuals^2) / fit$df.residual
# R is same for each coefficient, resvar is same within each model
se <- sqrt(outer(R, resvar))
pt(abs(est / se), df = fit$df.residual, lower.tail = FALSE) * 2
}
This calculates the same p-values as calling summary
(or Benjamin's pvalOnly
function). However, it skips all the other steps that summary
performs for each model, making it much faster. (Note that Benjamin's pvalOnly
calls vcov
, which in turn calls summary
, which is why it does not save time).
On a small matrix this is about 30 times faster than summary:
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
groups <- rep(c("wildtype", "mutant"), times = c(10, 3))
fit <- lm(mat ~ groups)
library(microbenchmark)
microbenchmark(summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])),
pvalOnly2(fit))
with results:
Unit: microseconds
expr min lq mean median uq max neval cld
summary 3383.085 3702.238 3978.110 3919.0755 4147.4015 5475.223 100 b
pvalOnly2(fit) 81.538 91.541 136.903 137.1275 157.5535 459.415 100 a
The speed advantage is much larger, however, when there are more models you're fitting. On a matrix of 13x1000, it has about a 300x advantage. And on my machine, when there are 20 million columns, it calculates the p-values in 25 seconds- twice as fast as the fit <- lm(mat ~ groups)
step, in fact.
> mat <- mat[, rep(1:10, 2e6)] # just replicating same coefs
> dim(mat)
[1] 13 20000000
> system.time(fit <- lm(mat ~ groups))
user system elapsed
37.272 10.296 58.372
> system.time(pvals <- pvalOnly2(fit))
user system elapsed
21.945 1.889 24.322
The resulting p-values are correct (same as you'd get out of summary):
> dim(pvals)
[1] 2 20000000
> pvals[, 1:10]
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) 0.006048267 0.01234835 0.02655251 0.0004555316 0.001004109 0.01608319
groupswildtype 0.129224604 0.22806894 0.88113522 0.2064583345 0.103624361 0.84642990
[,7] [,8] [,9] [,10]
(Intercept) 0.0004630405 0.1386393 0.05107805 5.042796e-05
groupswildtype 0.2717139022 0.1539826 0.66351492 5.942893e-02
(By the way, profiling shows that almost all the running time in the function is spent in the pt
function- since that's done in C this is about as fast as it can be made in any language).
In response to your comment, you can also get the per-model p-value (from the F-statistic) with the following function, which is a similar speed to pvalOnly2
:
modelPvalOnly <- function(fit) {
f <- t(fit$fitted.values)
if (attr(fit$terms, "intercept")) {
mss <- rowSums((f - rowMeans(f)) ^ 2)
numdf <- fit$rank - 1
} else {
mss <- rowSums(f ^ 2)
numdf <- fit$rank
}
resvar <- colSums(fit$residuals^2) / fit$df.residual
fstat <- (mss / numdf) / resvar
pval <- pf(fstat, numdf, fit$df.residual, lower.tail = FALSE)
pval
}
How about giving the broom
package a try?
install.packages(broom)
library(broom)
tidy(lm(mat ~ groups))
# response term estimate std.error statistic p.value
# 1 Y1 (Intercept) 27.000000 7.967548 3.3887465 6.048267e-03
# 2 Y1 groupswt 14.900000 9.084402 1.6401740 1.292246e-01
# 3 Y2 (Intercept) 23.333333 7.809797 2.9877004 1.234835e-02
# 4 Y2 groupswt 11.366667 8.904539 1.2765026 2.280689e-01
# 5 Y3 (Intercept) 44.000000 17.192317 2.5592828 2.655251e-02
# ...and more...
Then to extract only the results for groupswt
(note: various ways to accomplish this...):
subset(tidy(lm(mat ~ groups)), term == "groupswt")[, c(1,6)]
# response p.value
# 2 Y1 0.12922460
# 4 Y2 0.22806894
# 6 Y3 0.88113522
# 8 Y4 0.20645833
# 10 Y5 0.10362436
# 12 Y6 0.84642990
# 14 Y7 0.27171390
# 16 Y8 0.15398258
# 18 Y9 0.66351492
# 20 Y10 0.05942893
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With