I have a very large (~500,000 x ~500,000) sparse matrix in R, and I am trying to divide each column by its sum:
sm = t(t(sm) / colSums(sm))
However, when I do this I get the following error:
# Error in evaluating the argument 'x' in selecting a method for function 't':
# Error: cannot allocate vector of size 721.1 Gb
Is there a better way to do this in R? I am able to store colSums
fine, as well as compute and store the transpose of the sparse matrix, but the problem seems to arrive when trying to perform "/"
. Looks like sparse matrix is converted to full dense matrix here.
Any help would be greatly appreciated. Thank you!
This is what we can do, assuming A
is a dgCMatrix
:
A@x <- A@x / rep.int(colSums(A), diff(A@p))
This requires some understanding of dgCMatrix
class.
@x
stores none-zero matrix values, in a packed 1D array;@p
stores the cumulative number of non-zero elements by column, hence diff(A@p)
gives the number of non-zero elements for each column.We repeat each element of colSums(A)
by number of none-zero elements in that column, then divide A@x
by this vector. In this end, we update A@x
by rescaled values. In this way, column rescaling is done in a sparse manner.
Example:
library(Matrix)
set.seed(2); A <- Matrix(rbinom(100,10,0.05), nrow = 10)
#10 x 10 sparse Matrix of class "dgCMatrix"
# [1,] . . 1 . 2 . 1 . . 2
# [2,] 1 . . . . . 1 . 1 .
# [3,] . 1 1 1 . 1 1 . . .
# [4,] . . . 1 . 2 . . . .
# [5,] 2 . . . 2 . 1 . . .
# [6,] 2 1 . 1 1 1 . 1 1 .
# [7,] . 2 . 1 2 1 . . 2 .
# [8,] 1 . . . . 3 . 1 . .
# [9,] . . 2 1 . 1 . . 1 .
#[10,] . . . . 1 1 . . . .
diff(A@p) ## number of non-zeros per column
# [1] 4 3 3 5 5 7 4 2 4 1
colSums(A) ## column sums
# [1] 6 4 4 5 8 10 4 2 5 2
A@x <- A@x / rep.int(colSums(A), diff(A@p)) ## sparse column rescaling
#10 x 10 sparse Matrix of class "dgCMatrix"
# [1,] . . 0.25 . 0.250 . 0.25 . . 1
# [2,] 0.1666667 . . . . . 0.25 . 0.2 .
# [3,] . 0.25 0.25 0.2 . 0.1 0.25 . . .
# [4,] . . . 0.2 . 0.2 . . . .
# [5,] 0.3333333 . . . 0.250 . 0.25 . . .
# [6,] 0.3333333 0.25 . 0.2 0.125 0.1 . 0.5 0.2 .
# [7,] . 0.50 . 0.2 0.250 0.1 . . 0.4 .
# [8,] 0.1666667 . . . . 0.3 . 0.5 . .
# [9,] . . 0.50 0.2 . 0.1 . . 0.2 .
#[10,] . . . . 0.125 0.1 . . . .
@thelatemail mentioned another method, by first converting dgCMatrix
to dgTMatrix
:
AA <- as(A, "dgTMatrix")
A@x <- A@x / colSumns(A)[AA@j + 1L]
For dgTMatrix
class there is no @p
but @j
, giving the column index (0 based) for none zero matrix elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With