Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recode a variable using data.table

I am trying to recode a variable using data.table. I have googled for almost 2 hours but couldn't find an answer.

Assume I have a data.table as the following:

DT <- data.table(V1=c(0L,1L,2L),
                 V2=LETTERS[1:3],
                 V4=1:12)

I want to recode V1 and V2. For V1, I want to recode 1s to 0 and 2s to 1. For V2, I want to recode A to T, B to K, C to D.

If I use dplyr, it is simple.

library(dplyr)
DT %>% 
  mutate(V1 = recode(V1, `1` = 0L, `2` = 1L)) %>% 
  mutate(V2 = recode(V2, A = "T", B = "K", C = "D"))

But I have no idea how to do this in data.table

DT[V1==1, V1 := 0]
DT[V1==2, V1 := 1]
DT[V2=="A", V2 := "T"]
DT[V2=="B", V2 := "K"]
DT[V2=="C", V2 := "D"]

Above is the code that I can think as my best. But there must be a better and a more efficient way to do this.


Edit

I changed how I want to recode V2 to make my example more general.

like image 507
user3077008 Avatar asked Jun 16 '17 13:06

user3077008


2 Answers

With data.table the recode can be solved with an update on join:

DT[.(V1 = 1:2, to = 0:1), on = "V1", V1 := i.to]
DT[.(V2 = LETTERS[1:3], to = c("T", "K", "D")), on = "V2", V2 := i.to]

which converts DT to

    V1 V2 V4
 1:  0  T  1
 2:  0  K  2
 3:  1  D  3
 4:  0  T  4
 5:  0  K  5
 6:  1  D  6
 7:  0  T  7
 8:  0  K  8
 9:  1  D  9
10:  0  T 10
11:  0  K 11
12:  1  D 12

Edit: @Frank suggested to use i.to to be on the safe side.

Explanation

The expressions .(V1 = 1:2, to = 0:1) and .(V2 = LETTERS[1:3], to = c("T", "K", "D")), resp., create lookup tables on-the-fly.

Alternatively, the lookup tables can be set-up beforehand

lut1 <- data.table(V1 = 1:2, to = 0:1)
lut2 <- data.table(V2 = LETTERS[1:3], to = c("T", "K", "D"))

lut1
   V1 to
1:  1  0
2:  2  1
lut2
   V2 to
1:  A  T
2:  B  K
3:  C  D

Then, the update joins become

DT[lut1, on = "V1", V1 := i.to]
DT[lut2, on = "V2", V2 := i.to]

Edit 2: Answers to How can I use this code dynamically?

mat asked "How can I use this code dynamically?"

So, here is a modified version where the name of column to update is provided as a character variable my_var_name but the lookup tables still are created on-the-fly:

my_var_name <- "V1"
DT[.(from = 1:2, to = 0:1), on = paste0(my_var_name, "==from"), 
   (my_var_name) := i.to]
my_var_name <- "V2"
DT[.(from = LETTERS[1:3], to = c("T", "K", "D")), on = paste0(my_var_name, "==from"), 
   (my_var_name) := i.to]

There are 3 points to note:

  1. Instead of naming the first column of the lookup table dynamically it gets a fixed name from. This requires a join between differently named columns (foreign key join). The names of the columns to join on have to be specified via the on parameter.
  2. The on parameter accepts character strings for foreign key joins of the form "V1==from". This string is created dynamically using paste0().
  3. In the expression (my_var_name) := i.to, the parentheses around the variable my_var_name forces to use the contents of my_var_name.

Dynamic code using pre-defined lookup tables

Now, while the column to recode is specified dynamically by a variable, the lookup tables to use are still hard-coded in the statement which means we have stopped halfways: We need also to select the appropriate lookup table dynamically.

This can be achieved by storing the lookup tables in a list where each list element is named according to the column of DT it is supposed to recode:

 lut_list  <- list(
  V1 = data.table(from = 1:2, to = 0:1),
  V2 = data.table(from = LETTERS[1:3], to = c("T", "K", "D"))
)
lut_list
$V1
    from    to
   <int> <int>
1:     1     0
2:     2     1

$V2
     from     to
   <char> <char>
1:      A      T
2:      B      K
3:      C      D

Now, we can pick the appropriate lookup table from the list dynamically as well:

my_var_name <- "V1"
DT[lut_list[[my_var_name]], on = paste0(my_var_name, "==from"), 
   (my_var_name) := i.to]

Going one step further, we can recode all relevant columns of DT in a loop:

for (v in intersect(names(lut_list), colnames(DT))) {
  DT[lut_list[[v]], on = paste0(v, "==from"), (v) := i.to]
}

Note that DT is updated by reference, i.e., only the affected elements are replaced in place without copying the whole object. So, the for loop is applied iteratively on the same data object. This is a speciality of data.table and will not work with data.frames or tibbles.

like image 96
Uwe Avatar answered Sep 21 '22 21:09

Uwe


mapvalues() from plyr, in combination with data.table, works really well.

I use it on large-ish data (50 mio - 400 mio rows). Although I haven't benchmarked it as compared to other possibilities, I find the clear syntax is worth a lot, as it means fewer errors in complicated recode operations.

library(data.table)
library(plyr)


DT <- data.table(V1=c(0L,1L,2L),
                 V2=LETTERS[1:3],
                 V4=1:12)


DT[, V1 := mapvalues(V1, from=c(1, 2), to=c(0, 1))]
DT[, V2 := mapvalues(V2, from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))]

For more complicated recode operations, I would always create a new variable first with NA, and use another data.table with from-to vectors as variables.

A feature that in some use-cases is more of a bug is that mapvalues() keeps those values from the old variable that isn't in the from argument. This is a problem if you're sure that all the correct values is in the from-vector, so that any values in the data.table that isn't in this vector should be NA instead.

DT <- data.table(V1=c(LETTERS[1:3], 'i dont want this value transfered'),
                 V4=1:12)
map_DT <- data.table(from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))


# NA variable to begin with is good practice because it is clearer to spot an error  
DT[, V1_new := NA_character_]
DT[V1 %in% map_DT$from , V1_new := mapvalues(V1, from=map_DT$from, to=map_DT$to)][]

note that plyr is deprecated, so the mapvalues-function is somewhat at risk of disappearing at some point in the future. the update-joins method proposed might be a better method because of this, although I find mapvalues to be just a tad clearer to read. although it will probably take years before mapvalues is deprecated, most likely, a lot of years. But still, something to keep in mind when deciding to use it as a tool or not.

like image 32
emilBeBri Avatar answered Sep 22 '22 21:09

emilBeBri