Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join datatables using column names stored in variables

Tags:

join

r

data.table

I have 2 data.tables:

library(data.table)
dt1 <- data.table(id = 1:5, value1 = 11:15, value2 = 21:25, value3 = 36:40)
dt2 <- data.table(name = c("value1", "value1", "value1", "value1", 
                            "value2", "value2", "value2", "value3", "value3"), 
              valueMin = c(10, 13, 14, 18, 21, 24, 25, 36, 38), 
              valueMax = c(13, 14, 18, 20, 24, 25, 27, 38, 42), 
              label = c(101:104, 201:203, 301:302))
> dt1
   id value1 value2 value3
1:  1     11     21     36
2:  2     12     22     37
3:  3     13     23     38
4:  4     14     24     39
5:  5     15     25     40
> dt2
     name valueMin valueMax label
1: value1       10       13   101
2: value1       13       14   102
3: value1       14       18   103
4: value1       18       20   104
5: value2       21       24   201
6: value2       24       25   202
7: value2       25       27   203
8: value3       36       38   301
9: value3       38       42   302

The result I expect is the following: joining label from dt2 to dt1 by the fact that value1 in dt1 is between valueMin and valueMax in dt2 and dt2$name matches to value1). Here is a solution I have (gives correct result):

varName <- "value1"
dt2_temp <- dt2[name == varName,]
dt1[dt2_temp, on = .(value1 > valueMin, value1 <= valueMax), nomatch = 0] %>%
select(id, label)
   id label
   1:  1   101
   2:  2   101
   3:  3   101
   4:  4   102
   5:  5   103

I would like to do the same (get label columns) for all the rest columns (value2, value3) in dt1 (using loop), therefore need to replace reference to column name value1 in join to it's name stored in varName, something like:

dt1[dt2_temp, on = .(varName > valueMin, varName <= valueMax), nomatch = 0]

Unfortunately, I did not succeed using: simply varName, eval(varName), as.name(varName). Do you have an idea how to solve this?

Error message is similar to:

Error in `[.data.table`(dt1, dt2_temp, on = .(varName > valueMin, varName <= valueMax),  : 
  Column(s) [varName,varName] not found in x
like image 938
Jekaterina Borodina Avatar asked Jul 05 '18 08:07

Jekaterina Borodina


4 Answers

Posting another method that programmatically constructs the on string (see the on argument in ?data.table)

dt1[dt2_temp, 
    on=c(paste0(varName, ">valueMin"), paste0(varName, "<=valueMax")),
    nomatch=0L]

Note that there should not be any space around the variable names.

like image 141
chinsoon12 Avatar answered Oct 12 '22 08:10

chinsoon12


Why not do it all in one go without a loop?

A possible solution:

melt(dt1, id = 1)[dt2, on = .(variable = name, value > valueMin, value <= valueMax), lbl := i.label
                  ][, dcast(.SD, id ~ variable, value.var = c("value","lbl"))]

which gives:

   id value_value1 value_value2 value_value3 lbl_value1 lbl_value2 lbl_value3
1:  1           11           21           36        101         NA         NA
2:  2           12           22           37        101        201        301
3:  3           13           23           38        101        201        301
4:  4           14           24           39        102        201        302
5:  5           15           25           40        103        202        302
like image 29
Jaap Avatar answered Oct 12 '22 09:10

Jaap


melt(dt1,1)[dt2, on = .(value> valueMin, value <= valueMax,variable=name), nomatch = 0]

   id variable value value.1 label
 1:  1   value1    10      13   101
 2:  2   value1    10      13   101
 3:  3   value1    10      13   101
 4:  4   value1    13      14   102
 5:  5   value1    14      18   103
 6:  2   value2    21      24   201
 7:  3   value2    21      24   201
 8:  4   value2    21      24   201
 9:  5   value2    24      25   202
10:  2   value3    36      38   301
11:  3   value3    36      38   301
12:  4   value3    38      42   302
13:  5   value3    38      42   302
like image 2
KU99 Avatar answered Oct 12 '22 07:10

KU99


One of the approach could be

library(data.table)

dcast(dt2[melt(dt1, id.vars = 1),    #left join of long form of dt1 and original dt2
          .( id, variable, label),   #only keep concerned columns from merged table
          on = .(name = variable,  valueMax >= value, valueMin < value)],  #join conditions
      id ~ variable, 
      value.var = "label")           #long to wide format using dcast to get the final result

which gives

   id value1 value2 value3
1:  1    101     NA     NA
2:  2    101    201    301
3:  3    101    201    301
4:  4    102    201    302
5:  5    103    202    302
like image 2
1.618 Avatar answered Oct 12 '22 09:10

1.618