Yesterday I gave this answer: Matching Data Tables by five columns to change a value in another column.
In the comments, the OP asked if we could effectively achieve a left join of the two tables and thereby get the NAs that would result in the right table to be assigned to the left table. It seems to me that data.table does not provide any means of doing this.
Here's the example case I used in that question:
set.seed(1L);
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
dt1;
## id V1 V2 blah1
## 1: 1 1 1 -0.6264538
## 2: 2 2 1 0.1836433
## 3: 3 3 1 -0.8356286
## 4: 4 1 2 1.5952808
## 5: 5 2 2 0.3295078
## 6: 6 3 2 -0.8204684
## 7: 7 1 3 0.4874291
## 8: 8 2 3 0.7383247
## 9: 9 3 3 0.5757814
## 10: 10 1 4 -0.3053884
## 11: 11 2 4 1.5117812
## 12: 12 3 4 0.3898432
dt2;
## id V1 V2 blah2
## 1: 13 1 1 -0.62124058
## 2: 14 2 1 -2.21469989
## 3: 15 1 2 1.12493092
## 4: 16 2 2 -0.04493361
## 5: 17 1 3 -0.01619026
## 6: 18 2 3 0.94383621
key <- paste0('V',1:2);
And here's the solution I gave which does not get NAs for non-matching rows:
dt1[dt2,on=key,id:=i.id];
dt1;
## id V1 V2 blah1
## 1: 13 1 1 -0.6264538
## 2: 14 2 1 0.1836433
## 3: 3 3 1 -0.8356286
## 4: 15 1 2 1.5952808
## 5: 16 2 2 0.3295078
## 6: 6 3 2 -0.8204684
## 7: 17 1 3 0.4874291
## 8: 18 2 3 0.7383247
## 9: 9 3 3 0.5757814
## 10: 10 1 4 -0.3053884
## 11: 11 2 4 1.5117812
## 12: 12 3 4 0.3898432
What we need is for the id
values 12 and under that remain in dt1
to be replaced with NAs (not because they are 12 and under, and not because those id values are missing from dt2
, but because the join on the key
columns, namely V1
and V2
, does not result in a match for those rows in dt1
against dt2
).
As I said in the comments of that question, a workaround is to preassign dt1$id
to all NAs, and then run the index-join-assignment. Hence, this is the expected output:
dt1$id <- NA;
dt1[dt2,on=key,id:=i.id];
dt1;
## id V1 V2 blah1
## 1: 13 1 1 -0.6264538
## 2: 14 2 1 0.1836433
## 3: NA 3 1 -0.8356286
## 4: 15 1 2 1.5952808
## 5: 16 2 2 0.3295078
## 6: NA 3 2 -0.8204684
## 7: 17 1 3 0.4874291
## 8: 18 2 3 0.7383247
## 9: NA 3 3 0.5757814
## 10: NA 1 4 -0.3053884
## 11: NA 2 4 1.5117812
## 12: NA 3 4 0.3898432
I think the workaround is ok, but I'm not sure why data.table doesn't seem to be capable of this functionality in one shot with an index-join-assign operation. The following are three dead-ends I explored:
1: nomatch
data.table provides a nomatch
argument which looks a little bit like the all
, all.x
, and all.y
arguments of merge()
. This is actually a very limited argument; it only allows changing from a right join (nomatch=NA
, the default) to an inner join (nomatch=0
). We cannot achieve a left join with it.
2: flip dt1
and dt2
Since dt1[dt2]
is a right join, we can just flip it, meaning dt2[dt1]
, to achieve the corresponding left join.
This will not work either because we need to use the :=
in-place assignment syntax in the j
argument to assign into dt1
, and under the flipped call, we'll instead be assigning to dt2
. I tried assigning to i.id
under the flipped command, but it didn't affect the original dt1
.
3: use merge.data.table()
We can call merge.data.table()
with the all.x=T
argument to achieve a left join. The problem now is that merge.data.table()
has no j
argument, and it simply provides no means of assigning in-place a column of the left (or right) table.
So, is it possible to perform this operation at all using data.table? And if so, what's the best way to do it?
AFAIU you just want to lookup id
column from dt2
to dt1
. Original id
variable in dt1
seems to be unrelated to whole process as you join on V1,V2
and you don't want to have dt1$id
values in the result. So technically correct way to address it is to not use that column at all.
set.seed(1)
library(data.table)
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
on = paste0("V",1:2) # I rename to `on` to not mask `key` function
dt1[,id:=NULL
][dt2,on=on,id:=i.id
][]
# V1 V2 blah1 id
# 1: 1 1 -0.6264538 13
# 2: 2 1 0.1836433 14
# 3: 3 1 -0.8356286 NA
# 4: 1 2 1.5952808 15
# 5: 2 2 0.3295078 16
# 6: 3 2 -0.8204684 NA
# 7: 1 3 0.4874291 17
# 8: 2 3 0.7383247 18
# 9: 3 3 0.5757814 NA
#10: 1 4 -0.3053884 NA
#11: 2 4 1.5117812 NA
#12: 3 4 0.3898432 NA
Aside from the question...
- you don't have to use ;
at the end of line if there is only single expression to evaluate
- use dt1[, id := NA_integer_]
instead of dt1$id <- NA
- use set.seed
when providing code with rnorm
and other randomness related calls
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With