I am trying to recode a variable using data.table. I have googled for almost 2 hours but couldn't find an answer.
Assume I have a data.table as the following:
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
I want to recode V1 and V2. For V1, I want to recode 1s to 0 and 2s to 1. For V2, I want to recode A to T, B to K, C to D.
If I use dplyr
, it is simple.
library(dplyr)
DT %>%
mutate(V1 = recode(V1, `1` = 0L, `2` = 1L)) %>%
mutate(V2 = recode(V2, A = "T", B = "K", C = "D"))
But I have no idea how to do this in data.table
DT[V1==1, V1 := 0]
DT[V1==2, V1 := 1]
DT[V2=="A", V2 := "T"]
DT[V2=="B", V2 := "K"]
DT[V2=="C", V2 := "D"]
Above is the code that I can think as my best. But there must be a better and a more efficient way to do this.
Edit
I changed how I want to recode V2 to make my example more general.
With data.table
the recode can be solved with an update on join:
DT[.(V1 = 1:2, to = 0:1), on = "V1", V1 := i.to]
DT[.(V2 = LETTERS[1:3], to = c("T", "K", "D")), on = "V2", V2 := i.to]
which converts DT
to
V1 V2 V4
1: 0 T 1
2: 0 K 2
3: 1 D 3
4: 0 T 4
5: 0 K 5
6: 1 D 6
7: 0 T 7
8: 0 K 8
9: 1 D 9
10: 0 T 10
11: 0 K 11
12: 1 D 12
Edit: @Frank suggested to use i.to
to be on the safe side.
The expressions .(V1 = 1:2, to = 0:1)
and .(V2 = LETTERS[1:3], to = c("T", "K", "D"))
, resp., create lookup tables on-the-fly.
Alternatively, the lookup tables can be set-up beforehand
lut1 <- data.table(V1 = 1:2, to = 0:1)
lut2 <- data.table(V2 = LETTERS[1:3], to = c("T", "K", "D"))
lut1
V1 to 1: 1 0 2: 2 1
lut2
V2 to 1: A T 2: B K 3: C D
Then, the update joins become
DT[lut1, on = "V1", V1 := i.to]
DT[lut2, on = "V2", V2 := i.to]
mat asked "How can I use this code dynamically?"
So, here is a modified version where the name of column to update is provided as a character variable my_var_name
but the lookup tables still are created on-the-fly:
my_var_name <- "V1"
DT[.(from = 1:2, to = 0:1), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
my_var_name <- "V2"
DT[.(from = LETTERS[1:3], to = c("T", "K", "D")), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
There are 3 points to note:
from
. This requires a join between differently named columns (foreign key join). The names of the columns to join on have to be specified via the on
parameter.on
parameter accepts character strings for foreign key joins of the form "V1==from"
. This string is created dynamically using paste0()
.(my_var_name) := i.to
, the parentheses around the variable my_var_name
forces to use the contents of my_var_name
.Now, while the column to recode is specified dynamically by a variable, the lookup tables to use are still hard-coded in the statement which means we have stopped halfways: We need also to select the appropriate lookup table dynamically.
This can be achieved by storing the lookup tables in a list where each list element is named according to the column of DT
it is supposed to recode:
lut_list <- list(
V1 = data.table(from = 1:2, to = 0:1),
V2 = data.table(from = LETTERS[1:3], to = c("T", "K", "D"))
)
lut_list
$V1 from to <int> <int> 1: 1 0 2: 2 1 $V2 from to <char> <char> 1: A T 2: B K 3: C D
Now, we can pick the appropriate lookup table from the list dynamically as well:
my_var_name <- "V1"
DT[lut_list[[my_var_name]], on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
Going one step further, we can recode all relevant columns of DT
in a loop:
for (v in intersect(names(lut_list), colnames(DT))) {
DT[lut_list[[v]], on = paste0(v, "==from"), (v) := i.to]
}
Note that DT
is updated by reference, i.e., only the affected elements are replaced in place without copying the whole object. So, the for
loop is applied iteratively on the same data object. This is a speciality of data.table and will not work with data.frames or tibbles.
mapvalues() from plyr, in combination with data.table, works really well.
I use it on large-ish data (50 mio - 400 mio rows). Although I haven't benchmarked it as compared to other possibilities, I find the clear syntax is worth a lot, as it means fewer errors in complicated recode operations.
library(data.table)
library(plyr)
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
DT[, V1 := mapvalues(V1, from=c(1, 2), to=c(0, 1))]
DT[, V2 := mapvalues(V2, from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))]
For more complicated recode operations, I would always create a new variable first with NA, and use another data.table with from-to vectors as variables.
A feature that in some use-cases is more of a bug is that mapvalues() keeps those values from the old variable that isn't in the from argument. This is a problem if you're sure that all the correct values is in the from-vector, so that any values in the data.table that isn't in this vector should be NA instead.
DT <- data.table(V1=c(LETTERS[1:3], 'i dont want this value transfered'),
V4=1:12)
map_DT <- data.table(from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))
# NA variable to begin with is good practice because it is clearer to spot an error
DT[, V1_new := NA_character_]
DT[V1 %in% map_DT$from , V1_new := mapvalues(V1, from=map_DT$from, to=map_DT$to)][]
note that plyr is deprecated, so the mapvalues-function is somewhat at risk of disappearing at some point in the future. the update-joins method proposed might be a better method because of this, although I find mapvalues to be just a tad clearer to read. although it will probably take years before mapvalues is deprecated, most likely, a lot of years. But still, something to keep in mind when deciding to use it as a tool or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With