I have a data frame which is mostly zeros (sparse data frame?) something similar to
name,factor_1,factor_2,factor_3
ABC,1,0,0
DEF,0,1,0
GHI,0,0,1
The actual data is about 90,000 rows with 10,000 features. Can I convert this to a sparse matrix? I am expecting to gain time and space efficiencies by utilizing a sparse matrix instead of a data frame.
Any help would be appreciated
Update #1: Here is some code to generate the data frame. Thanks Richard for providing this
x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF", "GHI"),
class = "factor"),
factor_1 = c(1L, 0L, 0L),
factor_2 = c(0L,1L, 0L),
factor_3 = c(0L, 0L, 1L)),
.Names = c("name", "factor_1","factor_2", "factor_3"),
class = "data.frame",
row.names = c(NA,-3L))
Convert a Data Frame into a Numeric Matrix in R Programming – data. matrix() Function. data. matrix() function in R Language is used to create a matrix by converting all the values of a Data Frame into numeric mode and then binding them as a matrix.
Converting a dataframe to sparse matrix We know that a dataframe is a table or 2-D array-like structure that has both rows and columns and is the most common way of storing data. We will convert the dataframe to a sparse matrix by using the sparseMatrix() function in R.
One of the ways to save the sparse matrix is to save them as Mtx file, that stores matrix in MatrixMarket format. We can use writeMM function to save the sparse matrix object into a file.
I do this all the time and it's a pain in the butt, so I wrote a method for it called sparsify() in my R package - mltools. It operates on data.table
s which are just fancy data.frames
.
To solve your specific problem...
Install mltools (or just copy the sparsify() method into your environment)
Load packages
library(data.table)
library(Matrix)
library(mltools)
Sparsify
x <- data.table(x) # convert x to a data.table
sparseM <- sparsify(x[, !"name"]) # sparsify everything except the name column
rownames(sparseM) <- x$name # set the rownames
> sparseM
3 x 3 sparse Matrix of class "dgCMatrix"
factor_1 factor_2 factor_3
ABC 1 . .
DEF . 1 .
GHI . . 1
In general, the sparsify() method is pretty flexible. Here's some examples of how you can use it:
Make some data. Notice data types and unused factor levels
dt <- data.table(
intCol=c(1L, NA_integer_, 3L, 0L),
realCol=c(NA, 2, NA, NA),
logCol=c(TRUE, FALSE, TRUE, FALSE),
ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),
ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE)
)
> dt
intCol realCol logCol ofCol ufCol
1: 1 NA TRUE a a
2: NA 2 FALSE b NA
3: 3 NA TRUE NA c
4: 0 NA FALSE b b
Out-Of-The-Box Use
> sparsify(dt)
4 x 7 sparse Matrix of class "dgCMatrix"
intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,] 1 NA 1 1 1 . .
[2,] NA 2 . 2 NA NA NA
[3,] 3 NA 1 NA . . 1
[4,] . NA . 2 . 1 .
Convert NAs to 0s and Sparsify Them
> sparsify(dt, sparsifyNAs=TRUE)
4 x 7 sparse Matrix of class "dgCMatrix"
intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,] 1 . 1 1 1 . .
[2,] . 2 . 2 . . .
[3,] 3 . 1 . . . 1
[4,] . . . 2 . 1 .
Generate Columns That Identify NA Values
> sparsify(dt[, list(realCol)], naCols="identify")
4 x 2 sparse Matrix of class "dgCMatrix"
realCol_NA realCol
[1,] 1 NA
[2,] . 2
[3,] 1 NA
[4,] 1 NA
Generate Columns That Identify NA Values In the Most Memory Efficient Manner
> sparsify(dt[, list(realCol)], naCols="efficient")
4 x 2 sparse Matrix of class "dgCMatrix"
realCol_NotNA realCol
[1,] . NA
[2,] 1 2
[3,] . NA
[4,] . NA
It might be a bit more memory efficient (but slower) to avoid copying all the data into a dense matrix:
y <- Reduce(cbind2, lapply(x[,-1], Matrix, sparse = TRUE))
rownames(y) <- x[,1]
#3 x 3 sparse Matrix of class "dgCMatrix"
#
#ABC 1 . .
#DEF . 1 .
#GHI . . 1
If you have sufficient memory you should use Richard's answer, i.e., turn your data.frame into a dense matrix and than use Matrix
.
You could make the first column into row names, then use Matrix
from the Matrix
package.
rownames(x) <- x$name
x <- x[-1]
library(Matrix)
Matrix(as.matrix(x), sparse = TRUE)
# 3 x 3 sparse Matrix of class "dtCMatrix"
# factor_1 factor_2 factor_3
# ABC 1 . .
# DEF . 1 .
# GHI . . 1
where the original x
data frame is
x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF",
"GHI"), class = "factor"), factor_1 = c(1L, 0L, 0L), factor_2 = c(0L,
1L, 0L), factor_3 = c(0L, 0L, 1L)), .Names = c("name", "factor_1",
"factor_2", "factor_3"), class = "data.frame", row.names = c(NA,
-3L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With