Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use the data.table index-join-assignment idiom to do a left join and assign NAs in the non-matching rows of i to x?

Yesterday I gave this answer: Matching Data Tables by five columns to change a value in another column.

In the comments, the OP asked if we could effectively achieve a left join of the two tables and thereby get the NAs that would result in the right table to be assigned to the left table. It seems to me that data.table does not provide any means of doing this.

Here's the example case I used in that question:

set.seed(1L);
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
dt1;
##     id V1 V2      blah1
##  1:  1  1  1 -0.6264538
##  2:  2  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4:  4  1  2  1.5952808
##  5:  5  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7:  7  1  3  0.4874291
##  8:  8  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432
dt2;
##    id V1 V2       blah2
## 1: 13  1  1 -0.62124058
## 2: 14  2  1 -2.21469989
## 3: 15  1  2  1.12493092
## 4: 16  2  2 -0.04493361
## 5: 17  1  3 -0.01619026
## 6: 18  2  3  0.94383621
key <- paste0('V',1:2);

And here's the solution I gave which does not get NAs for non-matching rows:

dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432

What we need is for the id values 12 and under that remain in dt1 to be replaced with NAs (not because they are 12 and under, and not because those id values are missing from dt2, but because the join on the key columns, namely V1 and V2, does not result in a match for those rows in dt1 against dt2).

As I said in the comments of that question, a workaround is to preassign dt1$id to all NAs, and then run the index-join-assignment. Hence, this is the expected output:

dt1$id <- NA;
dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3: NA  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6: NA  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9: NA  3  3  0.5757814
## 10: NA  1  4 -0.3053884
## 11: NA  2  4  1.5117812
## 12: NA  3  4  0.3898432

I think the workaround is ok, but I'm not sure why data.table doesn't seem to be capable of this functionality in one shot with an index-join-assign operation. The following are three dead-ends I explored:

1: nomatch

data.table provides a nomatch argument which looks a little bit like the all, all.x, and all.y arguments of merge(). This is actually a very limited argument; it only allows changing from a right join (nomatch=NA, the default) to an inner join (nomatch=0). We cannot achieve a left join with it.

2: flip dt1 and dt2

Since dt1[dt2] is a right join, we can just flip it, meaning dt2[dt1], to achieve the corresponding left join.

This will not work either because we need to use the := in-place assignment syntax in the j argument to assign into dt1, and under the flipped call, we'll instead be assigning to dt2. I tried assigning to i.id under the flipped command, but it didn't affect the original dt1.

3: use merge.data.table()

We can call merge.data.table() with the all.x=T argument to achieve a left join. The problem now is that merge.data.table() has no j argument, and it simply provides no means of assigning in-place a column of the left (or right) table.


So, is it possible to perform this operation at all using data.table? And if so, what's the best way to do it?

like image 963
bgoldst Avatar asked Aug 28 '16 19:08

bgoldst


1 Answers

AFAIU you just want to lookup id column from dt2 to dt1. Original id variable in dt1 seems to be unrelated to whole process as you join on V1,V2 and you don't want to have dt1$id values in the result. So technically correct way to address it is to not use that column at all.

set.seed(1)
library(data.table)
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
on = paste0("V",1:2) # I rename to `on` to not mask `key` function
dt1[,id:=NULL
    ][dt2,on=on,id:=i.id
      ][]
#  V1 V2      blah1 id
# 1:  1  1 -0.6264538 13
# 2:  2  1  0.1836433 14
# 3:  3  1 -0.8356286 NA
# 4:  1  2  1.5952808 15
# 5:  2  2  0.3295078 16
# 6:  3  2 -0.8204684 NA
# 7:  1  3  0.4874291 17
# 8:  2  3  0.7383247 18
# 9:  3  3  0.5757814 NA
#10:  1  4 -0.3053884 NA
#11:  2  4  1.5117812 NA
#12:  3  4  0.3898432 NA

Aside from the question...
- you don't have to use ; at the end of line if there is only single expression to evaluate
- use dt1[, id := NA_integer_] instead of dt1$id <- NA
- use set.seed when providing code with rnorm and other randomness related calls

like image 197
jangorecki Avatar answered Sep 19 '22 08:09

jangorecki