I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors.
However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram.
Currently, I am using the following code and it works very fine and takes seconds:
library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)
However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr()
, so first I need to convert my sparse matrix to the SparseM format.
I tried to do:
library(SparseM)
X2 <- as.matrix.csr(X)
but it very quickly fills my RAM and eventually R crashes. I suspect that internally, as.matrix.csr
first converts the sparse matrix to a dense matrix that does not fit in my computer memory.
My other alternative would be to create my sparse matrix directly in the SparseM format.
I tried as.matrix.csr(X_factors)
but it does not accept a data-frame of factors.
Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors)
in the SparseM package? I searched in the documentation but I did not find.
As a general criterion the number of non−zero elements are expected to be equal to the number of rows or number of columns. To create a sparse matrix in R, we can use sparseMatrix function of Matrix package.
One of the ways to save the sparse matrix is to save them as Mtx file, that stores matrix in MatrixMarket format. We can use writeMM function to save the sparse matrix object into a file. In this example, we save our toy sparse matrix into file named “sparse_matrix. mtx”.
The dgCMatrix class is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.
Quite tricky but I think I got it.
Let's start with a sparse matrix from the Matrix
package:
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)
The Matrix
package uses a column-oriented compression format, while SparseM
supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other.
So we will first convert our column-oriented Matrix
into a column-oriented SparseM
matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0
or 1
):
X.csc <- new("matrix.csc", ra = X@x,
ja = X@i + 1L,
ia = X@p + 1L,
dimension = X@Dim)
Then, change from column-oriented to row-oriented format:
X.csr <- as.matrix.csr(X.csc)
And you're done! You can check that the two matrices are identical (on my small example) by doing:
range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With