Directly creating dummy variable set in a sparse matrix in R

Tags:

Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You'd like to create a dummy variable data set, but since it would be too sparse, you would like to keep dummies in sparse matrix format.

My data set is quite big and the less steps there are, the better for me. I know how to do above steps; but I couldn't get my head around directly creating that sparse matrix from the initial data set, i.e. having one step instead of two. Any ideas?

EDIT: Some comments asked for further elaboration, so here it goes:

Where X is my original data set with 1000 columns and 50000 records, each column having 15 levels,

Step1: Creating dummy variables from the original data set with a code like;

# Creating dummy data set with empty values
dummified <- matrix(NA,nrow(X),15*ncol(X))
# Adding values to this data set for each column and each level within columns
for (i in 1:ncol(X)){colFactr <- factor(X[,i],exclude=NULL)
  for (j in 1:l){
    lvl <- levels(colFactr)[j]
    indx <- ((i-1)*l)+j
    dummified[,indx] <- ifelse(colFactr==lvl,1,0)
  }
}

Step2: Converting that huge matrix into a sparse matrix, with a code like;

sparse.dummified <- sparseMatrix(dummified)

But this approach still created this interim large matrix which takes a lot of time & memory, therefore I am asking the direct methodology (if there is any).

654

asked Apr 12 '14 20:04

agondiken

1 Answers

Thanks for having clarified your question, try this.

Here is sample data with two columns that have three and two levels respectively:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))
#   x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D

library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                          j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#               
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .

Edit: @user20650 pointed out do.call(cBind, ...) was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:

n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
     rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)

190

answered Nov 15 '22 13:11

flodel

Related questions
                            
                                How to draw a border around a barplot in R the same way a border is drawn for a boxplot
                            
                                Accurately converting from character->POSIXct->character with sub millisecond datetimes
                            
                                How to find the last or next entry using R package data.table and rolling joins
                            
                                Remove line from geom_smooth in ggplot2
                            
                                Size of points in ggplot2 comparable across plots?
                            
                                Distance of point feature to nearest polygon in R
                            
                                Blockwise sum of matrix elements
                            
                                R: Backtesting a trading strategy. Beginners to quantmod and R
                            
                                point of intersection 2 normal curves
                            
                                NA-recognizing boolean operator
                            
                                Shortcut for if else
                            
                                Assignment of a value from a foreach loop
                            
                                Indicator function in R
                            
                                Extract a substring between two words from a string
                            
                                Extract time series of a point ( lon, lat) from netCDF in R
                            
                                Finding which element of a vector is between two values in R
                            
                                Interpretation of negative index when subsetting a data.frame [duplicate]
                            
                                Optimized version of grep to match vector against vector
                            
                                How to determine the size of all objects in the current workspace in R? (not in WIndows)
                            
                                shiny use number of select elments in conditionalPanel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Directly creating dummy variable set in a sparse matrix in R

Tags:

r

matrix

r-factor

sparse-matrix

agondiken

People also ask

1 Answers

flodel

Recent Activity

Donate For Us