I am working with an extremely large dataset in R and have been operating with data frames and have decided to switch to data.tables to help speed up with operations. I am having trouble understanding the J operations, in particular I'm trying to generate dummy variables but I can't figure out how to code conditional operations within data.tables[].
MWE:
test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))
What I would like to do is to add columns a
through j
as dummy variables such that column a
would have a value 1
when the index == "a"
and 0
otherwise. In the data.frame environment it would look something like:
test$a <- 0
test$a[test$index=='a'] <- 1
Using dummy_cols() function It creates dummy variables on the basis of parameters provided in the function. If columns are not selected in the function call for which dummy variable has to be created, then dummy variables are created for all characters and factors column in the dataframe.
This recoding is called “dummy coding” and leads to the creation of a table called contrast matrix. This is done automatically by statistical software, such as R. Here, you'll learn how to build and interpret a linear regression model with categorical predictor variables.
To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);
There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent ...
This seems to do what you're looking for:
inds <- unique(test$index)
test[, (inds) := lapply(inds, function(x) index == x)]
which gives
index var1 a b c d e f g h i j
1: a 0.25331851 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2: b -0.02854676 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3: c -0.04287046 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4: d 1.36860228 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
5: e -0.22577099 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
---
996: f -1.02040059 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
997: g -1.31345092 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
998: h -0.49448088 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
999: i 1.75175715 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
1000: j 0.05576477 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Here's another way:
dcast(test, index + var1 ~ index, fun = length)
# or, if you want to preserve row order
dcast(test[, r := .I], r + index + var1 ~ index, fun = length)[, r := NULL]
And another:
rs = split(seq(nrow(test)), test$index)
test[, names(rs) := FALSE ]
for (n in names(rs)) set(test, i = rs[[n]], j = n, v = TRUE )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With