Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perform a semi-join with data.table

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:

x <- data.table(x = 1:2, y = c("a", "b")) setkey(x, x) y <- data.table(x = c(1, 1), z = 10:11)  x[y] #   x y  z # 1: 1 a 10 # 2: 1 a 11 

A semi-join would return just x[1]

like image 693
hadley Avatar asked Sep 23 '13 21:09

hadley


People also ask

What is semi join with example?

Semijoin is a technique for processing a join between two tables that are stored sites. The basic idea is to reduce the transfer cost by first sending only the projected join column(s) to the other site, where it is joined with the second relation.

What are semi joins in SQL?

Semijoins are U-SQL's way filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A. key IN (SELECT B. key FROM B) pattern.

What is semi Anti join?

An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is found.


2 Answers

More possibilities :

w = unique(x[y,which=TRUE])  # the row numbers in x which have a match from y x[w] 

If there are duplicate key values in x, then that needs :

w = unique(x[y,which=TRUE,allow.cartesian=TRUE]) x[w] 

Or, the other way around :

setkey(y,x) w = !is.na(y[x,which=TRUE,mult="first"]) x[w] 

If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.

But the anti anti join appeals too :-)

like image 116
Matt Dowle Avatar answered Oct 06 '22 02:10

Matt Dowle


One solution I can think of is:

tmp <- x[!y] x[!tmp] 

In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:

x <- data.table(x = 1:10, y = letters[1:10]) setkey(x, x) y <- data.table(x = c(1,3,5,1), z = 1:4)  > x[y]    x y z 1: 1 a 1 2: 3 c 2 3: 5 e 3 4: 1 a 4 

The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:

> x[!y]     x y 1:  2 b 2:  4 d 3:  6 f 4:  7 g 5:  8 h 6:  9 i 7: 10 j 

It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.

The ! feature was added in data.table 1.8.4 with the following note in NEWS:

o   A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.         DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works         DT[!"a"]                              # same result, now preferred.         DT[!J(6),...]                         # !J == not-join         DT[!2:3,...]                          # ! on all types of i         DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)         DT[!J(6L,23L)]                        # same result, faster binary search     '!' has been used rather than '-' :         * to match the 'not-join'/'not-where' nomenclature         * with '-', DT[-0] would return DT rather than DT[0] and not be backwards           compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in           base R) and after this new feature.         * to leave DT[+J...] and DT[-J...] available for future use 

For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.

P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].

like image 28
Victor K. Avatar answered Oct 06 '22 03:10

Victor K.