Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - data frame - convert to sparse matrix

I have a data frame which is mostly zeros (sparse data frame?) something similar to

name,factor_1,factor_2,factor_3
ABC,1,0,0
DEF,0,1,0
GHI,0,0,1

The actual data is about 90,000 rows with 10,000 features. Can I convert this to a sparse matrix? I am expecting to gain time and space efficiencies by utilizing a sparse matrix instead of a data frame.

Any help would be appreciated

Update #1: Here is some code to generate the data frame. Thanks Richard for providing this

x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF", "GHI"),
                    class = "factor"), 
               factor_1 = c(1L, 0L, 0L), 
               factor_2 = c(0L,1L, 0L), 
               factor_3 = c(0L, 0L, 1L)), 
               .Names = c("name", "factor_1","factor_2", "factor_3"), 
               class = "data.frame",
               row.names = c(NA,-3L))
like image 569
Abhi Avatar asked Nov 19 '14 03:11

Abhi


People also ask

Can you convert data frame to matrix in R?

Convert a Data Frame into a Numeric Matrix in R Programming – data. matrix() Function. data. matrix() function in R Language is used to create a matrix by converting all the values of a Data Frame into numeric mode and then binding them as a matrix.

Which of the following is used to convert dataset into sparseMatrix?

Converting a dataframe to sparse matrix We know that a dataframe is a table or 2-D array-like structure that has both rows and columns and is the most common way of storing data. We will convert the dataframe to a sparse matrix by using the sparseMatrix() function in R.

How do I save a sparse matrix in R?

One of the ways to save the sparse matrix is to save them as Mtx file, that stores matrix in MatrixMarket format. We can use writeMM function to save the sparse matrix object into a file.


3 Answers

I do this all the time and it's a pain in the butt, so I wrote a method for it called sparsify() in my R package - mltools. It operates on data.tables which are just fancy data.frames.


To solve your specific problem...

Install mltools (or just copy the sparsify() method into your environment)

Load packages

library(data.table)
library(Matrix)
library(mltools)

Sparsify

x <- data.table(x)  # convert x to a data.table
sparseM <- sparsify(x[, !"name"])  # sparsify everything except the name column
rownames(sparseM) <- x$name  # set the rownames

> sparseM
3 x 3 sparse Matrix of class "dgCMatrix"
    factor_1 factor_2 factor_3
ABC        1        .        .
DEF        .        1        .
GHI        .        .        1

In general, the sparsify() method is pretty flexible. Here's some examples of how you can use it:

Make some data. Notice data types and unused factor levels

dt <- data.table(
  intCol=c(1L, NA_integer_, 3L, 0L),
  realCol=c(NA, 2, NA, NA),
  logCol=c(TRUE, FALSE, TRUE, FALSE),
  ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),
  ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE)
)
> dt
   intCol realCol logCol ofCol ufCol
1:      1      NA   TRUE     a     a
2:     NA       2  FALSE     b    NA
3:      3      NA   TRUE    NA     c
4:      0      NA  FALSE     b     b

Out-Of-The-Box Use

> sparsify(dt)
4 x 7 sparse Matrix of class "dgCMatrix"
     intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,]      1      NA      1     1       1       .       .
[2,]     NA       2      .     2      NA      NA      NA
[3,]      3      NA      1    NA       .       .       1
[4,]      .      NA      .     2       .       1       .

Convert NAs to 0s and Sparsify Them

> sparsify(dt, sparsifyNAs=TRUE)
4 x 7 sparse Matrix of class "dgCMatrix"
     intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,]      1       .      1     1       1       .       .
[2,]      .       2      .     2       .       .       .
[3,]      3       .      1     .       .       .       1
[4,]      .       .      .     2       .       1       .

Generate Columns That Identify NA Values

> sparsify(dt[, list(realCol)], naCols="identify")
4 x 2 sparse Matrix of class "dgCMatrix"
     realCol_NA realCol
[1,]          1      NA
[2,]          .       2
[3,]          1      NA
[4,]          1      NA

Generate Columns That Identify NA Values In the Most Memory Efficient Manner

> sparsify(dt[, list(realCol)], naCols="efficient")
4 x 2 sparse Matrix of class "dgCMatrix"
     realCol_NotNA realCol
[1,]             .      NA
[2,]             1       2
[3,]             .      NA
[4,]             .      NA
like image 76
Ben Avatar answered Oct 16 '22 17:10

Ben


It might be a bit more memory efficient (but slower) to avoid copying all the data into a dense matrix:

y <- Reduce(cbind2, lapply(x[,-1], Matrix, sparse = TRUE))
rownames(y) <- x[,1]

#3 x 3 sparse Matrix of class "dgCMatrix"
#         
#ABC 1 . .
#DEF . 1 .
#GHI . . 1

If you have sufficient memory you should use Richard's answer, i.e., turn your data.frame into a dense matrix and than use Matrix.

like image 38
Roland Avatar answered Oct 16 '22 17:10

Roland


You could make the first column into row names, then use Matrix from the Matrix package.

rownames(x) <- x$name
x <- x[-1]
library(Matrix)
Matrix(as.matrix(x), sparse = TRUE)
# 3 x 3 sparse Matrix of class "dtCMatrix"
#     factor_1 factor_2 factor_3
# ABC        1        .        .
# DEF        .        1        .
# GHI        .        .        1

where the original x data frame is

x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF", 
"GHI"), class = "factor"), factor_1 = c(1L, 0L, 0L), factor_2 = c(0L, 
1L, 0L), factor_3 = c(0L, 0L, 1L)), .Names = c("name", "factor_1", 
"factor_2", "factor_3"), class = "data.frame", row.names = c(NA, 
-3L))
like image 3
Rich Scriven Avatar answered Oct 16 '22 18:10

Rich Scriven