Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: sparse matrix conversion

I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors.

However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram.

Currently, I am using the following code and it works very fine and takes seconds:

library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)

However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr(), so first I need to convert my sparse matrix to the SparseM format.

I tried to do:

library(SparseM)  
X2 <- as.matrix.csr(X)

but it very quickly fills my RAM and eventually R crashes. I suspect that internally, as.matrix.csr first converts the sparse matrix to a dense matrix that does not fit in my computer memory.

My other alternative would be to create my sparse matrix directly in the SparseM format.
I tried as.matrix.csr(X_factors) but it does not accept a data-frame of factors.

Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors) in the SparseM package? I searched in the documentation but I did not find.

like image 944
Benoit_Plante Avatar asked Jun 28 '13 23:06

Benoit_Plante


People also ask

How do you make a sparse matrix in R?

As a general criterion the number of non−zero elements are expected to be equal to the number of rows or number of columns. To create a sparse matrix in R, we can use sparseMatrix function of Matrix package.

How do I save a sparse matrix in R?

One of the ways to save the sparse matrix is to save them as Mtx file, that stores matrix in MatrixMarket format. We can use writeMM function to save the sparse matrix object into a file. In this example, we save our toy sparse matrix into file named “sparse_matrix. mtx”.

What is dgCMatrix?

The dgCMatrix class is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.


1 Answers

Quite tricky but I think I got it.

Let's start with a sparse matrix from the Matrix package:

i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)

The Matrix package uses a column-oriented compression format, while SparseM supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other.

So we will first convert our column-oriented Matrix into a column-oriented SparseM matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0 or 1):

X.csc <- new("matrix.csc", ra = X@x,
                           ja = X@i + 1L,
                           ia = X@p + 1L,
                           dimension = X@Dim)

Then, change from column-oriented to row-oriented format:

X.csr <- as.matrix.csr(X.csc)

And you're done! You can check that the two matrices are identical (on my small example) by doing:

range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0
like image 194
flodel Avatar answered Sep 26 '22 14:09

flodel