Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating dummy variables in R data.table

I am working with an extremely large dataset in R and have been operating with data frames and have decided to switch to data.tables to help speed up with operations. I am having trouble understanding the J operations, in particular I'm trying to generate dummy variables but I can't figure out how to code conditional operations within data.tables[].

MWE:

test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))

What I would like to do is to add columns a through j as dummy variables such that column a would have a value 1 when the index == "a" and 0 otherwise. In the data.frame environment it would look something like:

test$a <- 0

test$a[test$index=='a'] <- 1
like image 660
user2792957 Avatar asked Sep 18 '13 19:09

user2792957


People also ask

How do you make a variable dummy in R?

Using dummy_cols() function It creates dummy variables on the basis of parameters provided in the function. If columns are not selected in the function call for which dummy variable has to be created, then dummy variables are created for all characters and factors column in the dataframe.

Does R automatically create dummy variables?

This recoding is called “dummy coding” and leads to the creation of a table called contrast matrix. This is done automatically by statistical software, such as R. Here, you'll learn how to build and interpret a linear regression model with categorical predictor variables.

How do I convert categorical data to dummy variables in R?

To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);

How do you create a set of dummy variables?

There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...


1 Answers

This seems to do what you're looking for:

inds <- unique(test$index)
test[, (inds) := lapply(inds, function(x) index == x)]

which gives

      index        var1     a     b     c     d     e     f     g     h     i     j
   1:     a  0.25331851  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   2:     b -0.02854676 FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   3:     c -0.04287046 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   4:     d  1.36860228 FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
   5:     e -0.22577099 FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
  ---                                                                              
 996:     f -1.02040059 FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 997:     g -1.31345092 FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 998:     h -0.49448088 FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 999:     i  1.75175715 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
1000:     j  0.05576477 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Here's another way:

dcast(test, index + var1 ~ index, fun = length)
# or, if you want to preserve row order
dcast(test[, r := .I], r + index + var1 ~ index, fun = length)[, r := NULL]

And another:

rs = split(seq(nrow(test)), test$index)
test[, names(rs) := FALSE ]
for (n in names(rs)) set(test, i = rs[[n]], j = n, v = TRUE )
like image 75
Frank Avatar answered Oct 06 '22 02:10

Frank