My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names.
I've figured this approach: convert the matrices to data.table
objects, join them and then sum columns vectors.
An example:
> M1
1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
> M2
1 3 4 5 8
1 0 0 1 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
8 0 0 0 0 0
> M1 %ms% M2
1 3 4 5 7 8
1 0 0 2 0 0 0
3 0 0 0 0 0 0
4 2 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
This is my code:
M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6)
colnames(M1) <- c(1,3,4,5,7,8)
M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5)
colnames(M2) <- c(1,3,4,5,8)
# to data.table objects
DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn")
DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn")
# join and sum of common columns
if (nrow(DT1) > nrow(DT2)) {
A <- DT2[DT1, roll = TRUE]
A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn]
}
That outputs:
rn X1 X3 X4 X5 X7 X8
1: 1 0 0 2 0 0 0
2: 3 0 0 0 0 0 0
3: 4 2 0 0 0 0 0
4: 5 0 0 0 0 0 0
5: 7 0 0 0 0 1 0
6: 8 0 0 0 0 0 0
Then I can convert back this data.table
to a matrix
and fix row and column names.
The questions are:
how to generalize this procedure?
I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1)
because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance.
In summary I need a merge procedure that behaves as described.
there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table
monster help me)
to what kind of join (inner, outer, etc. etc.) is assimilable this procedure?
Thanks in advance.
p.s.: I'm using data.table version 1.8.2
EDIT - SOLUTIONS
@Aaron solution. No external libraries, only base R. It works also on list of matrices.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols))
for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m
out
}
@MadScone solution. Use reshape2
package. It works only on two matrices per call.
add_matrices_2 <- function(m1, m2) {
m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum)
mn <- unique(colnames(m1), colnames(m2))
rownames(m) <- mn
colnames(m) <- mn
m
}
@Aaron solution. Use Matrix
package. It work only on sparse matrices, also on list of them.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i < j
sparseMatrix(
i = ifelse(ilj, i, j),
j = ifelse(ilj, j, i),
x = s$x,
dims = c(nrows, ncols),
dimnames = list(rows, cols),
symmetric = TRUE
)
})
Reduce(`+`, newms)
}
BENCHMARK (100 runs with microbenchmark
package)
Unit: microseconds
expr min lq median uq max
1 add_matrices_1 196.009 257.5865 282.027 291.2735 549.397
2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448
No need to comment the benchmark: @Aaron solution wins.
Details
For insights about performances (that depend of the size and the sparsity of the matrices) see @Aaron's edit (and the solution for sparse matrices: add_matrices_3
).
I'd just line up the names and go to town with base R.
Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.
add_matrices_1 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
out
}
It then works like this:
# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)
add_matrices_1(M1, M2)
# 1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0
For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n
columns out of N
possibilities, and filling k
spots with non-zero values. (This assumes symmetrical matrices.)
makeM <- function(N, n, k) {
s1 <- sample(N, n)
M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
r1 <- sample(n,k, replace=TRUE)
c1 <- sample(n,k, replace=TRUE)
M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
M1
}
Then here's another version that uses sparse matrices.
add_matrices_3 <- function(...) {
a <- list(...)
cols <- sort(unique(unlist(lapply(a, colnames))))
rows <- sort(unique(unlist(lapply(a, rownames))))
nrows <- length(rows)
ncols <- length(cols)
newms <- lapply(a, function(m) {
s <- summary(m)
i <- match(rownames(m), rows)[s$i]
j <- match(colnames(m), cols)[s$j]
ilj <- i<j
sparseMatrix(i=ifelse(ilj, i, j),
j=ifelse(ilj, j, i),
x=s$x,
dims=c(nrows, ncols),
dimnames=list(rows, cols), symmetric=TRUE)
})
Reduce(`+`, newms)
}
This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)
set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
# user system elapsed
# 2.987 0.841 4.133
system.time(add_matrices_3(mm1, mm2))
# user system elapsed
# 0.042 0.012 0.504
But when the matrices are small, my first solution is still faster.
set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
# expr min lq median uq max
# 1 add_matrices_1(M1, M2) 398.495 406.543 423.825 544.0905 43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24
Moral of the story: Size and sparsity matter.
Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With