Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data table: update join

Tags:

r

data.table

Suppose I have two data tables:

X <- data.table(id = 1:5, L = letters[1:5])

   id L
1:  1 a
2:  2 b
3:  3 c
4:  4 d
5:  5 e

Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))

   id  L  N
1:  3 NA 10
2:  4  g NA
3:  5  h 12

Would it be possible to do a left outer join of X and Y by id using data table built-in functions? If not, I would like build a function (e.g. leftOuterJoin) with the following expected output:

leftOuterJoin(X, Y, on = "id")

   id  L  N
1:  1  a NA
2:  2  b NA
3:  3 NA 10
4:  4  g NA
5:  5  h 12

I have tried without success:

X[Y, on = "id"]

   id L i.L  N
1:  3 c  NA 10
2:  4 d   g NA
3:  5 e   h 12

I have also tried this, which is almost what I am looking for:

setkey(X, id)
setkey(Y, id)
merge(X, Y, all.x = TRUE)

   id L.x L.y  N
1:  1   a  NA NA
2:  2   b  NA NA
3:  3   c  NA 10
4:  4   d   g NA
5:  5   e   h 12
like image 432
mat Avatar asked Jun 08 '17 10:06

mat


People also ask

How do you join data tables in R?

If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.

How do you do a left join in a data table?

If you want to do a left join, you can use all. x = TRUE . If you want to do a full outer join, you can use all = TRUE .

How do you inner join in R?

How to do an inner join on data frames in R? To perform inner join use either merge() function, dplyr inner_join() function, or use reduce() from tidyverse. Using the dplyr function is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join data frames in R.

When I is a data table or character vector the columns to join by must be specified using?

table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).


2 Answers

I may have miss few things, please correct me if there is a better solution. I usually like to write function for such things.

Here one: the goal is to have all possibilities available. Join and update as well the join variable, use other variable names ...

> update.DT <- function(DATA1, DATA2, join.variable, overwrite.variable, overwrite.with.variable) {
+       
+       DATA1[DATA2, c(overwrite.variable) := mget(p0("i.", overwrite.with.variable)), on = join.variable][]
+       
+     }
> X <- X2 <- X3 <- data.table(id = 1:5, L = letters[1:5], PS = rep(59, 5))
> Y <- data.table(id = 3:5, id2 = 11:13, L = c("z", "g", "h"), PS = rep(61, 3))
> X
   id L PS
1:  1 a 59
2:  2 b 59
3:  3 c 59
4:  4 d 59
5:  5 e 59
> Y
   id id2 L PS
1:  3  11 z 61
2:  4  12 g 61
3:  5  13 h 61
> update.DT(DATA1 = X, DATA2 = Y, join.variable = "id", overwrite.variable = c("L"), overwrite.with.variable = c("L"))
   id L PS
1:  1 a 59
2:  2 b 59
3:  3 z 59
4:  4 g 59
5:  5 h 59
> update.DT(DATA1 = X2, DATA2 = Y, join.variable = "id", overwrite.variable = c("L", "PS"), overwrite.with.variable = c("L", "PS"))
   id L PS
1:  1 a 59
2:  2 b 59
3:  3 z 61
4:  4 g 61
5:  5 h 61
> update.DT(DATA1 = X2, DATA2 = Y, join.variable = "id", overwrite.variable = c("L", "PS", "id"), overwrite.with.variable = c("L", "PS", "id2"))
   id L PS
1:  1 a 59
2:  2 b 59
3: 11 z 61
4: 12 g 61
5: 13 h 61
like image 28
Dorian Grv Avatar answered Oct 14 '22 16:10

Dorian Grv


It is an update join:

library(data.table)
X <- data.table(id = 1:5, L = letters[1:5])
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
X[Y, on=.(id), c("L", "N"):=.(i.L, i.N)][]
#    id  L  N
# 1:  1  a NA
# 2:  2  b NA
# 3:  3 NA 10
# 4:  4  g NA
# 5:  5  h 12

gives you the desired result.
Here I found a solution for multiple columns:

library(data.table)
X <- data.table(id = 1:5, L = letters[1:5])
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))

X[Y, on=.(id), names(Y)[-1]:=mget(paste0("i.", names(Y)[-1]))]

Another variant:

n <- names(Y)
X[Y, on=.(id), (n):=mget(paste0("i.", n))]
like image 108
jogo Avatar answered Oct 14 '22 17:10

jogo