I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors. However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram. Currently, I am using the following code and it works very fine and takes seconds: <pre class="prettyprint"><code>library(Matrix) X_factors <- data.frame(lapply(my_matrix, as.factor)) #encode factor data in a sparse matrix X <- sparse.model.matrix(~.-1, data = X_factors) </code></pre> However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with <code>write.matrix.csr()</code>, so first I need to convert my sparse matrix to the SparseM format. I tried to do: <pre class="prettyprint"><code>library(SparseM) X2 <- as.matrix.csr(X) </code></pre> but it very quickly fills my RAM and eventually R crashes. I suspect that internally, <code>as.matrix.csr</code> first converts the sparse matrix to a dense matrix that does not fit in my computer memory. My other alternative would be to create my sparse matrix directly in the SparseM format. I tried <code>as.matrix.csr(X_factors)</code> but it does not accept a data-frame of factors. Is there an equivalent to <code>sparse.model.matrix(~.-1, data = X_factors)</code> in the SparseM package? I searched in the documentation but I did not find.

Quite tricky but I think I got it. Let's start with a sparse matrix from the <code>Matrix</code> package: <pre class="prettyprint"><code>i <- c(1,3:8) j <- c(2,9,6:10) x <- 7 * (1:7) X <- sparseMatrix(i, j, x = x) </code></pre> The <code>Matrix</code> package uses a column-oriented compression format, while <code>SparseM</code> supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other. So we will first convert our column-oriented <code>Matrix</code> into a column-oriented <code>SparseM</code> matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at <code>0</code> or <code>1</code>): <pre class="prettyprint"><code>X.csc <- new("matrix.csc", ra = X@x, ja = X@i + 1L, ia = X@p + 1L, dimension = X@Dim) </code></pre> Then, change from column-oriented to row-oriented format: <pre class="prettyprint"><code>X.csr <- as.matrix.csr(X.csc) </code></pre> And you're done! You can check that the two matrices are identical (on my small example) by doing: <pre class="prettyprint"><code>range(as.matrix(X) - as.matrix(X.csc)) # [1] 0 0 </code></pre>

R: sparse matrix conversion

Tags:

r

sparse-matrix

I have a matrix of factors in R and want to convert it to a matrix of dummy variables 0-1 for all possible levels of each factors.

However this "dummy" matrix is very large (91690x16593) and very sparse. I need to store it in a sparse matrix, otherwise it does not fit in my 12GB of ram.

Currently, I am using the following code and it works very fine and takes seconds:

library(Matrix)
X_factors <- data.frame(lapply(my_matrix, as.factor))
#encode factor data in a sparse matrix
X <- sparse.model.matrix(~.-1, data = X_factors)

However, I want to use the e1071 package in R, and eventually save this matrix to libsvm format with write.matrix.csr(), so first I need to convert my sparse matrix to the SparseM format.

I tried to do:

library(SparseM)  
X2 <- as.matrix.csr(X)

but it very quickly fills my RAM and eventually R crashes. I suspect that internally, as.matrix.csr first converts the sparse matrix to a dense matrix that does not fit in my computer memory.

My other alternative would be to create my sparse matrix directly in the SparseM format.
I tried as.matrix.csr(X_factors) but it does not accept a data-frame of factors.

Is there an equivalent to sparse.model.matrix(~.-1, data = X_factors) in the SparseM package? I searched in the documentation but I did not find.

944

asked Jun 28 '13 23:06

Benoit_Plante

1 Answers

Quite tricky but I think I got it.

Let's start with a sparse matrix from the Matrix package:

i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)

The Matrix package uses a column-oriented compression format, while SparseM supports both column and row oriented formats and has functions that can easily handle the conversion from one format to the other.

So we will first convert our column-oriented Matrix into a column-oriented SparseM matrix: we just need to be careful calling the right constructor and noticing that both packages use different conventions for indices (start at 0 or 1):

X.csc <- new("matrix.csc", ra = X@x,
                           ja = X@i + 1L,
                           ia = X@p + 1L,
                           dimension = X@Dim)

Then, change from column-oriented to row-oriented format:

X.csr <- as.matrix.csr(X.csc)

And you're done! You can check that the two matrices are identical (on my small example) by doing:

range(as.matrix(X) - as.matrix(X.csc))
# [1] 0 0

194

answered Sep 26 '22 14:09

flodel

Related questions
                            
                                "update by reference" vs shallow copy
                            
                                Shiny Responds to Enter
                            
                                conditionally output different colored text in Shiny
                            
                                Centre a plot to the middle of a page using Knitr
                            
                                Convert xml_nodeset to data.frame
                            
                                R: Efficient Way to Merge+Update Table With Second Table Where Values from Same Column Names Fill NAs
                            
                                Reverse the scale of the x axis in a plot
                            
                                Remove grey background confidence interval from forecasting plot
                            
                                Reverse datetime (POSIXct data) axis in ggplot
                            
                                How to combine ggplot and dplyr into a function?
                            
                                R How to read a file from google drive using R
                            
                                Why should someone use {} for initializing an empty object in R?
                            
                                How to find the border points of a particular shape
                            
                                How to merge colour and shape?
                            
                                constrained optimization in R
                            
                                How do I plot the first derivative of the smoothing function?
                            
                                facet_wrap fill by column
                            
                                Select along one of n dimensions in array
                            
                                Fill superimposed ellipses in ggplot2 scatterplots
                            
                                How to convert a sparse matrix into a matrix of index and value of non-zero element

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With