Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You'd like to create a dummy variable data set, but since it would be too sparse, you would like to keep dummies in sparse matrix format.
My data set is quite big and the less steps there are, the better for me. I know how to do above steps; but I couldn't get my head around directly creating that sparse matrix from the initial data set, i.e. having one step instead of two. Any ideas?
EDIT: Some comments asked for further elaboration, so here it goes:
Where X is my original data set with 1000 columns and 50000 records, each column having 15 levels,
Step1: Creating dummy variables from the original data set with a code like;
# Creating dummy data set with empty values
dummified <- matrix(NA,nrow(X),15*ncol(X))
# Adding values to this data set for each column and each level within columns
for (i in 1:ncol(X)){colFactr <- factor(X[,i],exclude=NULL)
for (j in 1:l){
lvl <- levels(colFactr)[j]
indx <- ((i-1)*l)+j
dummified[,indx] <- ifelse(colFactr==lvl,1,0)
}
}
Step2: Converting that huge matrix into a sparse matrix, with a code like;
sparse.dummified <- sparseMatrix(dummified)
But this approach still created this interim large matrix which takes a lot of time & memory, therefore I am asking the direct methodology (if there is any).
This recoding is called “dummy coding” and leads to the creation of a table called contrast matrix. This is done automatically by statistical software, such as R. Here, you'll learn how to build and interpret a linear regression model with categorical predictor variables.
Using dummy_cols() function It creates dummy variables on the basis of parameters provided in the function. If columns are not selected in the function call for which dummy variable has to be created, then dummy variables are created for all characters and factors column in the dataframe.
There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...
As a general criterion the number of non−zero elements are expected to be equal to the number of rows or number of columns. To create a sparse matrix in R, we can use sparseMatrix function of Matrix package.
Thanks for having clarified your question, try this.
Here is sample data with two columns that have three and two levels respectively:
set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
y = sample(c("D", "E"), n, TRUE))
# x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D
library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .
Edit: @user20650 pointed out do.call(cBind, ...)
was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:
n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With