Suppose I have two data tables:
X <- data.table(id = 1:5, L = letters[1:5])
id L
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
id L N
1: 3 NA 10
2: 4 g NA
3: 5 h 12
Would it be possible to do a left outer join of X
and Y
by id
using data table built-in functions? If not, I would like build a function (e.g. leftOuterJoin
) with the following expected output:
leftOuterJoin(X, Y, on = "id")
id L N
1: 1 a NA
2: 2 b NA
3: 3 NA 10
4: 4 g NA
5: 5 h 12
I have tried without success:
X[Y, on = "id"]
id L i.L N
1: 3 c NA 10
2: 4 d g NA
3: 5 e h 12
I have also tried this, which is almost what I am looking for:
setkey(X, id)
setkey(Y, id)
merge(X, Y, all.x = TRUE)
id L.x L.y N
1: 1 a NA NA
2: 2 b NA NA
3: 3 c NA 10
4: 4 d g NA
5: 5 e h 12
If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.
If you want to do a left join, you can use all. x = TRUE . If you want to do a full outer join, you can use all = TRUE .
How to do an inner join on data frames in R? To perform inner join use either merge() function, dplyr inner_join() function, or use reduce() from tidyverse. Using the dplyr function is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join data frames in R.
table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).
I may have miss few things, please correct me if there is a better solution. I usually like to write function for such things.
Here one: the goal is to have all possibilities available. Join and update as well the join variable, use other variable names ...
> update.DT <- function(DATA1, DATA2, join.variable, overwrite.variable, overwrite.with.variable) {
+
+ DATA1[DATA2, c(overwrite.variable) := mget(p0("i.", overwrite.with.variable)), on = join.variable][]
+
+ }
> X <- X2 <- X3 <- data.table(id = 1:5, L = letters[1:5], PS = rep(59, 5))
> Y <- data.table(id = 3:5, id2 = 11:13, L = c("z", "g", "h"), PS = rep(61, 3))
> X
id L PS
1: 1 a 59
2: 2 b 59
3: 3 c 59
4: 4 d 59
5: 5 e 59
> Y
id id2 L PS
1: 3 11 z 61
2: 4 12 g 61
3: 5 13 h 61
> update.DT(DATA1 = X, DATA2 = Y, join.variable = "id", overwrite.variable = c("L"), overwrite.with.variable = c("L"))
id L PS
1: 1 a 59
2: 2 b 59
3: 3 z 59
4: 4 g 59
5: 5 h 59
> update.DT(DATA1 = X2, DATA2 = Y, join.variable = "id", overwrite.variable = c("L", "PS"), overwrite.with.variable = c("L", "PS"))
id L PS
1: 1 a 59
2: 2 b 59
3: 3 z 61
4: 4 g 61
5: 5 h 61
> update.DT(DATA1 = X2, DATA2 = Y, join.variable = "id", overwrite.variable = c("L", "PS", "id"), overwrite.with.variable = c("L", "PS", "id2"))
id L PS
1: 1 a 59
2: 2 b 59
3: 11 z 61
4: 12 g 61
5: 13 h 61
It is an update join:
library(data.table)
X <- data.table(id = 1:5, L = letters[1:5])
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
X[Y, on=.(id), c("L", "N"):=.(i.L, i.N)][]
# id L N
# 1: 1 a NA
# 2: 2 b NA
# 3: 3 NA 10
# 4: 4 g NA
# 5: 5 h 12
gives you the desired result.
Here I found a solution for multiple columns:
library(data.table)
X <- data.table(id = 1:5, L = letters[1:5])
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
X[Y, on=.(id), names(Y)[-1]:=mget(paste0("i.", names(Y)[-1]))]
Another variant:
n <- names(Y)
X[Y, on=.(id), (n):=mget(paste0("i.", n))]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With