I have seen pairwise or general paired simple linear regression many times on Stack Overflow. Here is a toy dataset for this kind of problem. <pre class="prettyprint"><code>set.seed(0) X <- matrix(runif(100), 100, 5, dimnames = list(1:100, LETTERS[1:5])) b <- c(1, 0.7, 1.3, 2.9, -2) dat <- X * b[col(X)] + matrix(rnorm(100 * 5, 0, 0.1), 100, 5) dat <- as.data.frame(dat) pairs(dat) </code></pre> <img src="https://i.stack.imgur.com/5csaI.jpg" alt=""> So basically we want to compute 5 * 4 = 20 regression lines: <pre class="prettyprint"><code>----- A ~ B A ~ C A ~ D A ~ E B ~ A ----- B ~ C B ~ D B ~ E C ~ A C ~ B ----- C ~ D C ~ E D ~ A D ~ B D ~ C ----- D ~ E E ~ A E ~ B E ~ C E ~ D ----- </code></pre> Here is a poor man's strategy: <pre class="prettyprint"><code>poor <- function (dat) { n <- nrow(dat) p <- ncol(dat) ## all formulae LHS <- rep(colnames(dat), p) RHS <- rep(colnames(dat), each = p) ## function to fit and summarize a single model fitmodel <- function (LHS, RHS) { if (RHS == LHS) { z <- data.frame("LHS" = LHS, "RHS" = RHS, "alpha" = 0, "beta" = 1, "beta.se" = 0, "beta.tv" = Inf, "beta.pv" = 0, "sig" = 0, "R2" = 1, "F.fv" = Inf, "F.pv" = 0, stringsAsFactors = FALSE) } else { result <- summary(lm(reformulate(RHS, LHS), data = dat)) z <- data.frame("LHS" = LHS, "RHS" = RHS, "alpha" = result$coefficients[1, 1], "beta" = result$coefficients[2, 1], "beta.se" = result$coefficients[2, 2], "beta.tv" = result$coefficients[2, 3], "beta.pv" = result$coefficients[2, 4], "sig" = result$sigma, "R2" = result$r.squared, "F.fv" = result$fstatistic[[1]], "F.pv" = pf(result$fstatistic[[1]], 1, n - 2, lower.tail = FALSE), stringsAsFactors = FALSE) } z } ## loop through all models do.call("rbind.data.frame", c(Map(fitmodel, LHS, RHS), list(make.row.names = FALSE, stringsAsFactors = FALSE))) } </code></pre> The logic is clear: get all pairs, construct the model formula (<code>reformulate</code>), fit a regression (<code>lm</code>), do a summary <code>summary</code>, return all statistics and <code>rbind</code> them to be a data frame. OK, fine, but what if there are <code>p</code> variables? We then need to do <code>p * (p - 1)</code> regressions! An immediate improvement I could think of, is Fitting a linear model with multiple LHS. For example, the first column of that formula matrix is merged to <pre class="prettyprint"><code>cbind(B, C, D, E) ~ A </code></pre> This reduces the number of regression from <code>p * (p - 1)</code> to <code>p</code>. But we can definitely do even better without using <code>lm</code> and <code>summary</code>. Here is my previous attempt: Is there a fast estimation of simple regression (a regression line with only intercept and slope)?. It is fast because it uses covariance between variables for estimation, like solving the normal equation. But the <code>simpleLM</code> function there is pretty limited: <ol> <li>it needs to compute residual vectors to estimate residual standard error, which is a performance bottleneck;</li> <li>it doesn't support multiple LHS, so it needs be called <code>p * (p - 1)</code> times in pairwise regression settings).</li> </ol> Can we generalize it for fast pairwise regression, by writing a function <code>pairwise_simpleLM</code>? <hr> <h3>General paired simple linear regression</h3> A more useful variation of the above pairwise regression is the general paired regression between a set of LHS variables and a set of RHS variables. Example 1 Fit paired regression between LHS variables <code>A</code>, <code>B</code>, <code>C</code> and RHS variables <code>D</code>, <code>E</code>, that is, fit 6 simple linear regression lines: <pre class="prettyprint"><code>A ~ D A ~ E B ~ D B ~ E C ~ D C ~ E </code></pre> Example 2 Fit a simple linear regression with multiple LHS variables to a particular RHS variable, say: <code>cbind(A, B, C, D) ~ E</code>. Example 3 Fit a simple linear regression with a particular LHS variable, and a set of RHS variables one at a time, for example: <pre class="prettyprint"><code>A ~ B A ~ C A ~ D A ~ E </code></pre> Can we also have a fast function <code>general_paired_simpleLM</code> for this? <hr> Caution <ol> <li>All variables must be numeric; factors are not allowed or pairwise regression makes no sense.</li> <li>Weighted regression is not discussed, as variance-covariance method is not justified in that case.</li> </ol>

<h3>Some statistical result / background</h3> <img src="https://i.stack.imgur.com/BbaGA.png" alt=""> (Link in the picture: Function to calculate R2 (R-squared) in R) <hr> <h3>Computational details</h3> Computations involved here is basically the computation of the variance-covariance matrix. Once we have it, results for all pairwise regression is just element-wise matrix arithmetic. The variance-covariance matrix can be obtained by R function <code>cov</code>, but functions below compute it manually using <code>crossprod</code>. The advantage is that it can obviously benefit from an optimized BLAS library if you have it. Be aware that significant amount of simplification is made in this way. R function <code>cov</code> has argument <code>use</code> which allows handling <code>NA</code>, but <code>crossprod</code> does not. I am assuming that your <code>dat</code> has no missing values at all! If you do have missing values, remove them yourself with <code>na.omit(dat)</code>. The initial <code>as.matrix</code> that converts a data frame to a matrix might be an overhead. In principle if I code everything up in C / C++, I can eliminate this coercion. And in fact, many element-wise matrix matrix arithmetic can be merged into a single loop-nest. However, I really bother doing this at the moment (as I have no time). Some people may argue that the format of the final return is inconvenient. There could be other format: <ol> <li>a list of data frames, each giving the result of the regression for a particular LHS variable;</li> <li>a list of data frames, each giving the result of the regression for a particular RHS variable.</li> </ol> This is really opinion-based. Anyway, you can always do a <code>split.data.frame</code> by "LHS" column or "RHS" column yourself on the data frame I return you. <hr> <h3>R function <code>pairwise_simpleLM</code> </h3> <pre class="prettyprint"><code>pairwise_simpleLM <- function (dat) { ## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables) dat <- as.matrix(dat) n <- nrow(dat) p <- ncol(dat) ## variable summary: mean, (unscaled) covariance and (unscaled) variance m <- colMeans(dat) V <- crossprod(dat) - tcrossprod(m * sqrt(n)) d <- diag(V) ## R-squared (explained variance) and its complement R2 <- (V ^ 2) * tcrossprod(1 / d) R2_complement <- 1 - R2 R2_complement[seq.int(from = 1, by = p + 1, length = p)] <- 0 ## slope and intercept beta <- V * rep(1 / d, each = p) alpha <- m - beta * rep(m, each = p) ## residual sum of squares and standard error RSS <- R2_complement * d sig <- sqrt(RSS * (1 / (n - 2))) ## statistics for slope beta.se <- sig * rep(1 / sqrt(d), each = p) beta.tv <- beta / beta.se beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE) ## F-statistic and p-value F.fv <- (n - 2) * R2 / R2_complement F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE) ## export data.frame(LHS = rep(colnames(dat), times = p), RHS = rep(colnames(dat), each = p), alpha = c(alpha), beta = c(beta), beta.se = c(beta.se), beta.tv = c(beta.tv), beta.pv = c(beta.pv), sig = c(sig), R2 = c(R2), F.fv = c(F.fv), F.pv = c(F.pv), stringsAsFactors = FALSE) } </code></pre> Let's compare the result on the toy dataset in the question. <pre class="prettyprint"><code>oo <- poor(dat) rr <- pairwise_simpleLM(dat) all.equal(oo, rr) #[1] TRUE </code></pre> Let's see its output: <pre class="prettyprint"><code>rr[1:3, ] # LHS RHS alpha beta beta.se beta.tv beta.pv sig #1 A A 0.00000000 1.0000000 0.00000000 Inf 0.000000e+00 0.0000000 #2 B A 0.05550367 0.6206434 0.04456744 13.92594 5.796437e-25 0.1252402 #3 C A 0.05809455 1.2215173 0.04790027 25.50126 4.731618e-45 0.1346059 # R2 F.fv F.pv #1 1.0000000 Inf 0.000000e+00 #2 0.6643051 193.9317 5.796437e-25 #3 0.8690390 650.3142 4.731618e-45 </code></pre> When we have the same LHS and RHS, regression is meaningless hence intercept is 0, slope is 1, etc. What about speed? Still using this toy example: <pre class="prettyprint"><code>library(microbenchmark) microbenchmark("poor_man's" = poor(dat), "fast" = pairwise_simpleLM(dat)) #Unit: milliseconds # expr min lq mean median uq max # poor_man's 127.270928 129.060515 137.813875 133.390722 139.029912 216.24995 # fast 2.732184 3.025217 3.381613 3.134832 3.313079 10.48108 </code></pre> The gap is going be increasingly wider as we have more variables. For example, with 10 variables we have: <pre class="prettyprint"><code>set.seed(0) X <- matrix(runif(100), 100, 10, dimnames = list(1:100, LETTERS[1:10])) b <- runif(10) DAT <- X * b[col(X)] + matrix(rnorm(100 * 10, 0, 0.1), 100, 10) DAT <- as.data.frame(DAT) microbenchmark("poor_man's" = poor(DAT), "fast" = pairwise_simpleLM(DAT)) #Unit: milliseconds # expr min lq mean median uq max # poor_man's 548.949161 551.746631 573.009665 556.307448 564.28355 801.645501 # fast 3.365772 3.578448 3.721131 3.621229 3.77749 6.791786 </code></pre> <hr> <h3>R function <code>general_paired_simpleLM</code> </h3> <pre class="prettyprint"><code>general_paired_simpleLM <- function (dat_LHS, dat_RHS) { ## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables) dat_LHS <- as.matrix(dat_LHS) dat_RHS <- as.matrix(dat_RHS) if (nrow(dat_LHS) != nrow(dat_RHS)) stop("'dat_LHS' and 'dat_RHS' don't have same number of rows!") n <- nrow(dat_LHS) pl <- ncol(dat_LHS) pr <- ncol(dat_RHS) ## variable summary: mean, (unscaled) covariance and (unscaled) variance ml <- colMeans(dat_LHS) mr <- colMeans(dat_RHS) vl <- colSums(dat_LHS ^ 2) - ml * ml * n vr <- colSums(dat_RHS ^ 2) - mr * mr * n ##V <- crossprod(dat - rep(m, each = n)) ## cov(u, v) = E[(u - E[u])(v - E[v])] V <- crossprod(dat_LHS, dat_RHS) - tcrossprod(ml * sqrt(n), mr * sqrt(n)) ## cov(u, v) = E[uv] - E{u]E[v] ## R-squared (explained variance) and its complement R2 <- (V ^ 2) * tcrossprod(1 / vl, 1 / vr) R2_complement <- 1 - R2 ## slope and intercept beta <- V * rep(1 / vr, each = pl) alpha <- ml - beta * rep(mr, each = pl) ## residual sum of squares and standard error RSS <- R2_complement * vl sig <- sqrt(RSS * (1 / (n - 2))) ## statistics for slope beta.se <- sig * rep(1 / sqrt(vr), each = pl) beta.tv <- beta / beta.se beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE) ## F-statistic and p-value F.fv <- (n - 2) * R2 / R2_complement F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE) ## export data.frame(LHS = rep(colnames(dat_LHS), times = pr), RHS = rep(colnames(dat_RHS), each = pl), alpha = c(alpha), beta = c(beta), beta.se = c(beta.se), beta.tv = c(beta.tv), beta.pv = c(beta.pv), sig = c(sig), R2 = c(R2), F.fv = c(F.fv), F.pv = c(F.pv), stringsAsFactors = FALSE) } </code></pre> Apply this to Example 1 in the question. <pre class="prettyprint"><code>general_paired_simpleLM(dat[1:3], dat[4:5]) # LHS RHS alpha beta beta.se beta.tv beta.pv sig #1 A D -0.009212582 0.3450939 0.01171768 29.45071 1.772671e-50 0.09044509 #2 B D 0.012474593 0.2389177 0.01420516 16.81908 1.201421e-30 0.10964516 #3 C D -0.005958236 0.4565443 0.01397619 32.66585 1.749650e-54 0.10787785 #4 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866 #5 B E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.10581331 #6 C E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.11849472 # R2 F.fv F.pv #1 0.8984818 867.3441 1.772671e-50 #2 0.7427021 282.8815 1.201421e-30 #3 0.9158840 1067.0579 1.749650e-54 #4 0.8590604 597.3333 1.738263e-43 #5 0.7603718 310.9670 3.636655e-32 #6 0.8985126 867.6375 1.746439e-50 </code></pre> Apply this to Example 2 in the question. <pre class="prettyprint"><code>general_paired_simpleLM(dat[1:4], dat[5]) # LHS RHS alpha beta beta.se beta.tv beta.pv sig #1 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.1065687 #2 B E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.1058133 #3 C E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.1184947 #4 D E 0.066190196 -1.3767586 0.03597657 -38.26820 9.828853e-61 0.1952718 # R2 F.fv F.pv #1 0.8590604 597.3333 1.738263e-43 #2 0.7603718 310.9670 3.636655e-32 #3 0.8985126 867.6375 1.746439e-50 #4 0.9372782 1464.4551 9.828853e-61 </code></pre> Apply this to Example 3 in the question. <pre class="prettyprint"><code>general_paired_simpleLM(dat[1], dat[2:5]) # LHS RHS alpha beta beta.se beta.tv beta.pv sig #1 A B 0.112229318 1.0703491 0.07686011 13.92594 5.796437e-25 0.16446951 #2 A C 0.025628210 0.7114422 0.02789832 25.50126 4.731618e-45 0.10272687 #3 A D -0.009212582 0.3450939 0.01171768 29.45071 1.772671e-50 0.09044509 #4 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866 # R2 F.fv F.pv #1 0.6643051 193.9317 5.796437e-25 #2 0.8690390 650.3142 4.731618e-45 #3 0.8984818 867.3441 1.772671e-50 #4 0.8590604 597.3333 1.738263e-43 </code></pre> We can even just do a simple linear regression between two variables: <pre class="prettyprint"><code>general_paired_simpleLM(dat[1], dat[2]) # LHS RHS alpha beta beta.se beta.tv beta.pv sig #1 A B 0.1122293 1.070349 0.07686011 13.92594 5.796437e-25 0.1644695 # R2 F.fv F.pv #1 0.6643051 193.9317 5.796437e-25 </code></pre> This means that the <code>simpleLM</code> function in is now obsolete. <hr> Appendix: Markdown (needs MathJax support) fot the picture <pre class="prettyprint"><code>Denote our variables by $x_1$, $x_2$, etc, a pairwise simple linear regression takes the form $$x_i = \alpha_{ij} + \beta_{ij}x_j$$ where $\alpha_{ij}$ and $\beta_{ij}$ is the intercept and the slope of $x_i \sim x_j$, respectively. We also denote $m_i$ and $v_i$ as the sample mean and **unscaled** sample variance of $x_i$. Here, the unscaled variance is just the sum of squares without dividing by sample size, that is $v_i = \sum_{k = 1}^n(x_{ik} - m_i)^2 = (\sum_{k = 1}^nx_{ik}^2) - n m_i^2$. We also denote $V_{ij}$ as the **unscaled** covariance between $x_i$ and $x_j$: $V_{ij} = \sum_{k = 1}^n(x_{ik} - m_i)(x_{jk} - m_j)$ = $(\sum_{k = 1}^nx_{ik}x_{jk}) - nm_im_j$. Using the results for a simple linear regression given in [Function to calculate R2 (R-squared) in R](https://stackoverflow.com/a/40901487/4891738), we have $$\beta_{ij} = V_{ij} \ / \ v_j,\quad \alpha_{ij} = m_i - \beta_{ij}m_j,\quad r_{ij}^2 = V_{ij}^2 \ / \ (v_iv_j),$$ where $r_{ij}^2$ is the R-squared. Knowing $r_{ij}^2 = RSS_{ij} \ / \ TSS_{ij}$ where $RSS_{ij}$ and $TSS_{ij} = v_i$ are residual sum of squares and total sum of squares of $x_i \sim x_j$, we can derive $RSS_{ij}$ and residual standard error $\sigma_{ij}$ **without actually computing residuals**: $$RSS_{ij} = (1 - r_{ij}^2)v_i,\quad \sigma_{ij} = \sqrt{RSS_{ij} \ / \ (n - 2)}.$$ F-statistic $F_{ij}$ and associated p-value $p_{ij}^F$ can also be obtained from sum of squares: $$F_{ij} = \tfrac{(TSS_{ij} - RSS_{ij}) \ / \ 1}{RSS_{ij} \ / \ (n - 2)} = (n - 2) r_{ij}^2 \ / \ (1 - r_{ij}^2),\quad p_{ij}^F = 1 - \texttt{CDF_F}(F_{ij};\ 1,\ n - 2),$$ where $\texttt{CDF_F}$ denotes the CDF of F-distribution. The only thing left is the standard error $e_{ij}$, t-statistic $t_{ij}$ and associated p-value $p_{ij}^t$ for $\beta_{ij}$, which are $$e_{ij} = \sigma_{ij} \ / \ \sqrt{v_i},\quad t_{ij} = \beta_{ij} \ / \ e_{ij},\quad p_{ij}^t = 2 * \texttt{CDF_t}(-|t_{ij}|; \ n - 2),$$ where $\texttt{CDF_t}$ denotes the CDF of t-distribution. </code></pre>

Fast pairwise simple linear regression between variables in a data frame

Tags:

performance

r

linear-regression

regression

lm

I have seen pairwise or general paired simple linear regression many times on Stack Overflow. Here is a toy dataset for this kind of problem.

set.seed(0)
X <- matrix(runif(100), 100, 5, dimnames = list(1:100, LETTERS[1:5]))
b <- c(1, 0.7, 1.3, 2.9, -2)
dat <- X * b[col(X)] + matrix(rnorm(100 * 5, 0, 0.1), 100, 5)
dat <- as.data.frame(dat)
pairs(dat)

So basically we want to compute 5 * 4 = 20 regression lines:

-----  A ~ B  A ~ C  A ~ D  A ~ E
B ~ A  -----  B ~ C  B ~ D  B ~ E
C ~ A  C ~ B  -----  C ~ D  C ~ E
D ~ A  D ~ B  D ~ C  -----  D ~ E
E ~ A  E ~ B  E ~ C  E ~ D  -----

Here is a poor man's strategy:

poor <- function (dat) {
  n <- nrow(dat)
  p <- ncol(dat)
  ## all formulae
  LHS <- rep(colnames(dat), p)
  RHS <- rep(colnames(dat), each = p)
  ## function to fit and summarize a single model
  fitmodel <- function (LHS, RHS) {
    if (RHS == LHS) {
      z <- data.frame("LHS" = LHS, "RHS" = RHS,
                      "alpha" = 0,
                      "beta" = 1,
                      "beta.se" = 0,
                      "beta.tv" = Inf,
                      "beta.pv" = 0,
                      "sig" = 0,
                      "R2" = 1,
                      "F.fv" = Inf,
                      "F.pv" = 0,
                      stringsAsFactors = FALSE)
      } else {
      result <- summary(lm(reformulate(RHS, LHS), data = dat))
      z <- data.frame("LHS" = LHS, "RHS" = RHS,
                      "alpha" = result$coefficients[1, 1],
                      "beta" = result$coefficients[2, 1],
                      "beta.se" = result$coefficients[2, 2],
                      "beta.tv" = result$coefficients[2, 3],
                      "beta.pv" = result$coefficients[2, 4],
                      "sig" = result$sigma,
                      "R2" = result$r.squared,
                      "F.fv" = result$fstatistic[[1]],
                      "F.pv" = pf(result$fstatistic[[1]], 1, n - 2, lower.tail = FALSE),
                      stringsAsFactors = FALSE)
        }
      z
      }
  ## loop through all models
  do.call("rbind.data.frame", c(Map(fitmodel, LHS, RHS),
                                list(make.row.names = FALSE,
                                     stringsAsFactors = FALSE)))
  }

The logic is clear: get all pairs, construct the model formula (reformulate), fit a regression (lm), do a summary summary, return all statistics and rbind them to be a data frame.

OK, fine, but what if there are p variables? We then need to do p * (p - 1) regressions!

An immediate improvement I could think of, is Fitting a linear model with multiple LHS. For example, the first column of that formula matrix is merged to

cbind(B, C, D, E) ~ A

This reduces the number of regression from p * (p - 1) to p.

But we can definitely do even better without using lm and summary. Here is my previous attempt: Is there a fast estimation of simple regression (a regression line with only intercept and slope)?. It is fast because it uses covariance between variables for estimation, like solving the normal equation. But the simpleLM function there is pretty limited:

it needs to compute residual vectors to estimate residual standard error, which is a performance bottleneck;
it doesn't support multiple LHS, so it needs be called p * (p - 1) times in pairwise regression settings).

Can we generalize it for fast pairwise regression, by writing a function pairwise_simpleLM?

General paired simple linear regression

A more useful variation of the above pairwise regression is the general paired regression between a set of LHS variables and a set of RHS variables.

Example 1

Fit paired regression between LHS variables A, B, C and RHS variables D, E, that is, fit 6 simple linear regression lines:

A ~ D  A ~ E
B ~ D  B ~ E
C ~ D  C ~ E

Example 2

Fit a simple linear regression with multiple LHS variables to a particular RHS variable, say: cbind(A, B, C, D) ~ E.

Example 3

Fit a simple linear regression with a particular LHS variable, and a set of RHS variables one at a time, for example:

A ~ B  A ~ C  A ~ D  A ~ E

Can we also have a fast function general_paired_simpleLM for this?

Caution

All variables must be numeric; factors are not allowed or pairwise regression makes no sense.
Weighted regression is not discussed, as variance-covariance method is not justified in that case.

810

asked Aug 21 '18 17:08

Zheyuan Li

1 Answers

Some statistical result / background

(Link in the picture: Function to calculate R2 (R-squared) in R)

Computational details

Computations involved here is basically the computation of the variance-covariance matrix. Once we have it, results for all pairwise regression is just element-wise matrix arithmetic.

The variance-covariance matrix can be obtained by R function cov, but functions below compute it manually using crossprod. The advantage is that it can obviously benefit from an optimized BLAS library if you have it. Be aware that significant amount of simplification is made in this way. R function cov has argument use which allows handling NA, but crossprod does not. I am assuming that your dat has no missing values at all! If you do have missing values, remove them yourself with na.omit(dat).

The initial as.matrix that converts a data frame to a matrix might be an overhead. In principle if I code everything up in C / C++, I can eliminate this coercion. And in fact, many element-wise matrix matrix arithmetic can be merged into a single loop-nest. However, I really bother doing this at the moment (as I have no time).

Some people may argue that the format of the final return is inconvenient. There could be other format:

a list of data frames, each giving the result of the regression for a particular LHS variable;
a list of data frames, each giving the result of the regression for a particular RHS variable.

This is really opinion-based. Anyway, you can always do a split.data.frame by "LHS" column or "RHS" column yourself on the data frame I return you.

R function `pairwise_simpleLM`

pairwise_simpleLM <- function (dat) {
  ## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
  dat <- as.matrix(dat)
  n <- nrow(dat)
  p <- ncol(dat)
  ## variable summary: mean, (unscaled) covariance and (unscaled) variance
  m <- colMeans(dat)
  V <- crossprod(dat) - tcrossprod(m * sqrt(n))
  d <- diag(V)
  ## R-squared (explained variance) and its complement
  R2 <- (V ^ 2) * tcrossprod(1 / d)
  R2_complement <- 1 - R2
  R2_complement[seq.int(from = 1, by = p + 1, length = p)] <- 0
  ## slope and intercept
  beta <- V * rep(1 / d, each = p)
  alpha <- m - beta * rep(m, each = p)
  ## residual sum of squares and standard error
  RSS <- R2_complement * d
  sig <- sqrt(RSS * (1 / (n - 2)))
  ## statistics for slope
  beta.se <- sig * rep(1 / sqrt(d), each = p)
  beta.tv <- beta / beta.se
  beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
  ## F-statistic and p-value
  F.fv <- (n - 2) * R2 / R2_complement
  F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
  ## export
  data.frame(LHS = rep(colnames(dat), times = p),
             RHS = rep(colnames(dat), each = p),
             alpha = c(alpha),
             beta = c(beta),
             beta.se = c(beta.se),
             beta.tv = c(beta.tv),
             beta.pv = c(beta.pv),
             sig = c(sig),
             R2 = c(R2),
             F.fv = c(F.fv),
             F.pv = c(F.pv),
             stringsAsFactors = FALSE)
  }

Let's compare the result on the toy dataset in the question.

oo <- poor(dat)
rr <- pairwise_simpleLM(dat)
all.equal(oo, rr)
#[1] TRUE

Let's see its output:

rr[1:3, ]
#  LHS RHS      alpha      beta    beta.se  beta.tv      beta.pv       sig
#1   A   A 0.00000000 1.0000000 0.00000000      Inf 0.000000e+00 0.0000000
#2   B   A 0.05550367 0.6206434 0.04456744 13.92594 5.796437e-25 0.1252402
#3   C   A 0.05809455 1.2215173 0.04790027 25.50126 4.731618e-45 0.1346059
#         R2     F.fv         F.pv
#1 1.0000000      Inf 0.000000e+00
#2 0.6643051 193.9317 5.796437e-25
#3 0.8690390 650.3142 4.731618e-45

When we have the same LHS and RHS, regression is meaningless hence intercept is 0, slope is 1, etc.

What about speed? Still using this toy example:

library(microbenchmark)
microbenchmark("poor_man's" = poor(dat), "fast" = pairwise_simpleLM(dat))
#Unit: milliseconds
#       expr        min         lq       mean     median         uq       max
# poor_man's 127.270928 129.060515 137.813875 133.390722 139.029912 216.24995
#       fast   2.732184   3.025217   3.381613   3.134832   3.313079  10.48108

The gap is going be increasingly wider as we have more variables. For example, with 10 variables we have:

set.seed(0)
X <- matrix(runif(100), 100, 10, dimnames = list(1:100, LETTERS[1:10]))
b <- runif(10)
DAT <- X * b[col(X)] + matrix(rnorm(100 * 10, 0, 0.1), 100, 10)
DAT <- as.data.frame(DAT)
microbenchmark("poor_man's" = poor(DAT), "fast" = pairwise_simpleLM(DAT))
#Unit: milliseconds
#       expr        min         lq       mean     median        uq        max
# poor_man's 548.949161 551.746631 573.009665 556.307448 564.28355 801.645501
#       fast   3.365772   3.578448   3.721131   3.621229   3.77749   6.791786

R function `general_paired_simpleLM`

general_paired_simpleLM <- function (dat_LHS, dat_RHS) {
  ## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
  dat_LHS <- as.matrix(dat_LHS)
  dat_RHS <- as.matrix(dat_RHS)
  if (nrow(dat_LHS) != nrow(dat_RHS)) stop("'dat_LHS' and 'dat_RHS' don't have same number of rows!")
  n <- nrow(dat_LHS)
  pl <- ncol(dat_LHS)
  pr <- ncol(dat_RHS)
  ## variable summary: mean, (unscaled) covariance and (unscaled) variance
  ml <- colMeans(dat_LHS)
  mr <- colMeans(dat_RHS)
  vl <- colSums(dat_LHS ^ 2) - ml * ml * n
  vr <- colSums(dat_RHS ^ 2) - mr * mr * n
  ##V <- crossprod(dat - rep(m, each = n))  ## cov(u, v) = E[(u - E[u])(v - E[v])]
  V <- crossprod(dat_LHS, dat_RHS) - tcrossprod(ml * sqrt(n), mr * sqrt(n))  ## cov(u, v) = E[uv] - E{u]E[v]
  ## R-squared (explained variance) and its complement
  R2 <- (V ^ 2) * tcrossprod(1 / vl, 1 / vr)
  R2_complement <- 1 - R2
  ## slope and intercept
  beta <- V * rep(1 / vr, each = pl)
  alpha <- ml - beta * rep(mr, each = pl)
  ## residual sum of squares and standard error
  RSS <- R2_complement * vl
  sig <- sqrt(RSS * (1 / (n - 2)))
  ## statistics for slope
  beta.se <- sig * rep(1 / sqrt(vr), each = pl)
  beta.tv <- beta / beta.se
  beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
  ## F-statistic and p-value
  F.fv <- (n - 2) * R2 / R2_complement
  F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
  ## export
  data.frame(LHS = rep(colnames(dat_LHS), times = pr),
             RHS = rep(colnames(dat_RHS), each = pl),
             alpha = c(alpha),
             beta = c(beta),
             beta.se = c(beta.se),
             beta.tv = c(beta.tv),
             beta.pv = c(beta.pv),
             sig = c(sig),
             R2 = c(R2),
             F.fv = c(F.fv),
             F.pv = c(F.pv),
             stringsAsFactors = FALSE)
  }

Apply this to Example 1 in the question.

general_paired_simpleLM(dat[1:3], dat[4:5])
#  LHS RHS        alpha       beta    beta.se   beta.tv      beta.pv        sig
#1   A   D -0.009212582  0.3450939 0.01171768  29.45071 1.772671e-50 0.09044509
#2   B   D  0.012474593  0.2389177 0.01420516  16.81908 1.201421e-30 0.10964516
#3   C   D -0.005958236  0.4565443 0.01397619  32.66585 1.749650e-54 0.10787785
#4   A   E  0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
#5   B   E  0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.10581331
#6   C   E  0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.11849472
#         R2      F.fv         F.pv
#1 0.8984818  867.3441 1.772671e-50
#2 0.7427021  282.8815 1.201421e-30
#3 0.9158840 1067.0579 1.749650e-54
#4 0.8590604  597.3333 1.738263e-43
#5 0.7603718  310.9670 3.636655e-32
#6 0.8985126  867.6375 1.746439e-50

Apply this to Example 2 in the question.

general_paired_simpleLM(dat[1:4], dat[5])
#  LHS RHS       alpha       beta    beta.se   beta.tv      beta.pv       sig
#1   A   E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.1065687
#2   B   E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.1058133
#3   C   E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.1184947
#4   D   E 0.066190196 -1.3767586 0.03597657 -38.26820 9.828853e-61 0.1952718
#         R2      F.fv         F.pv
#1 0.8590604  597.3333 1.738263e-43
#2 0.7603718  310.9670 3.636655e-32
#3 0.8985126  867.6375 1.746439e-50
#4 0.9372782 1464.4551 9.828853e-61

Apply this to Example 3 in the question.

general_paired_simpleLM(dat[1], dat[2:5])
#  LHS RHS        alpha       beta    beta.se   beta.tv      beta.pv        sig
#1   A   B  0.112229318  1.0703491 0.07686011  13.92594 5.796437e-25 0.16446951
#2   A   C  0.025628210  0.7114422 0.02789832  25.50126 4.731618e-45 0.10272687
#3   A   D -0.009212582  0.3450939 0.01171768  29.45071 1.772671e-50 0.09044509
#4   A   E  0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
#         R2     F.fv         F.pv
#1 0.6643051 193.9317 5.796437e-25
#2 0.8690390 650.3142 4.731618e-45
#3 0.8984818 867.3441 1.772671e-50
#4 0.8590604 597.3333 1.738263e-43

We can even just do a simple linear regression between two variables:

general_paired_simpleLM(dat[1], dat[2])
#  LHS RHS     alpha     beta    beta.se  beta.tv      beta.pv       sig
#1   A   B 0.1122293 1.070349 0.07686011 13.92594 5.796437e-25 0.1644695
#         R2     F.fv         F.pv
#1 0.6643051 193.9317 5.796437e-25

This means that the simpleLM function in is now obsolete.

Appendix: Markdown (needs MathJax support) fot the picture

Denote our variables by $x_1$, $x_2$, etc, a pairwise simple linear regression takes the form $$x_i = \alpha_{ij} + \beta_{ij}x_j$$ where $\alpha_{ij}$ and $\beta_{ij}$ is the intercept and the slope of $x_i \sim x_j$, respectively. We also denote $m_i$ and $v_i$ as the sample mean and **unscaled** sample variance of $x_i$. Here, the unscaled variance is just the sum of squares without dividing by sample size, that is $v_i = \sum_{k = 1}^n(x_{ik} - m_i)^2 = (\sum_{k = 1}^nx_{ik}^2) - n m_i^2$. We also denote $V_{ij}$ as the **unscaled** covariance between $x_i$ and $x_j$: $V_{ij} = \sum_{k = 1}^n(x_{ik} - m_i)(x_{jk} - m_j)$ = $(\sum_{k = 1}^nx_{ik}x_{jk}) - nm_im_j$.

Using the results for a simple linear regression given in [Function to calculate R2 (R-squared) in R](https://stackoverflow.com/a/40901487/4891738), we have $$\beta_{ij} = V_{ij} \ / \ v_j,\quad \alpha_{ij} = m_i - \beta_{ij}m_j,\quad r_{ij}^2 = V_{ij}^2 \ / \ (v_iv_j),$$ where $r_{ij}^2$ is the R-squared. Knowing $r_{ij}^2 = RSS_{ij} \ / \ TSS_{ij}$ where $RSS_{ij}$ and $TSS_{ij} = v_i$ are residual sum of squares and total sum of squares of $x_i \sim x_j$, we can derive $RSS_{ij}$ and residual standard error $\sigma_{ij}$ **without actually computing residuals**: $$RSS_{ij} = (1 - r_{ij}^2)v_i,\quad \sigma_{ij} = \sqrt{RSS_{ij} \ / \ (n - 2)}.$$

F-statistic $F_{ij}$ and associated p-value $p_{ij}^F$ can also be obtained from sum of squares: $$F_{ij} = \tfrac{(TSS_{ij} - RSS_{ij}) \ / \ 1}{RSS_{ij} \ / \ (n - 2)} = (n - 2) r_{ij}^2 \ / \ (1 - r_{ij}^2),\quad p_{ij}^F = 1 - \texttt{CDF_F}(F_{ij};\ 1,\ n - 2),$$ where $\texttt{CDF_F}$ denotes the CDF of F-distribution.

The only thing left is the standard error $e_{ij}$, t-statistic $t_{ij}$ and associated p-value $p_{ij}^t$ for $\beta_{ij}$, which are $$e_{ij} = \sigma_{ij} \ / \ \sqrt{v_i},\quad t_{ij} = \beta_{ij} \ / \ e_{ij},\quad p_{ij}^t = 2 * \texttt{CDF_t}(-|t_{ij}|; \ n - 2),$$ where $\texttt{CDF_t}$ denotes the CDF of t-distribution.

answered Oct 20 '22 13:10

Zheyuan Li

Related questions
                            
                                Changing select input choices in r Shiny
                            
                                Exceeded maximum number of DLLs in R
                            
                                R: a cat of many colors
                            
                                Topological data analysis - where to begin
                            
                                Why do some primitives have byte-codes and some do not?
                            
                                Difference between ls() and objects()
                            
                                Why does R store the loop variable/index/dummy in memory?
                            
                                Text Justification in Rmarkdown word document
                            
                                How to solve gaps and island problems in R and performance vs SQL?
                            
                                store arrangeGrob to object, does not create printable object
                            
                                force a regular plot object into a Grob for use in grid.arrange
                            
                                Count combinations of length 2 per id
                            
                                Parallel processing with xgboost and caret
                            
                                geom_blank drops NA
                            
                                Is there a _merge indicator available after a merge?
                            
                                Redrawing venn diagram
                            
                                tidyverse interfering with ggplot2? cannot access map_data
                            
                                How to plot x-axis labels and bars between tick marks in ggplot2 bar plot?
                            
                                Process list from R to C and access it
                            
                                Arithmetic in R faster on numerics as opposed to integers. What's going on?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast pairwise simple linear regression between variables in a data frame

Tags:

performance

r

linear-regression

regression

lm

General paired simple linear regression

Zheyuan Li

People also ask

1 Answers

Some statistical result / background

Computational details

R function `pairwise_simpleLM`

R function `general_paired_simpleLM`

Zheyuan Li

Recent Activity

Donate For Us

Fast pairwise simple linear regression between variables in a data frame

Tags:

performance

r

linear-regression

regression

lm

General paired simple linear regression

Zheyuan Li

People also ask

1 Answers

Some statistical result / background

Computational details

R function pairwise_simpleLM

R function general_paired_simpleLM

Zheyuan Li

Related questions

Recent Activity

Donate For Us

R function `pairwise_simpleLM`

R function `general_paired_simpleLM`