Perform a semi-join with data.table

Tags:

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:

x <- data.table(x = 1:2, y = c("a", "b")) setkey(x, x) y <- data.table(x = c(1, 1), z = 10:11)  x[y] #   x y  z # 1: 1 a 10 # 2: 1 a 11

A semi-join would return just x[1]

693

asked Sep 23 '13 21:09

hadley

2 Answers

More possibilities :

w = unique(x[y,which=TRUE])  # the row numbers in x which have a match from y x[w]

If there are duplicate key values in x, then that needs :

w = unique(x[y,which=TRUE,allow.cartesian=TRUE]) x[w]

Or, the other way around :

setkey(y,x) w = !is.na(y[x,which=TRUE,mult="first"]) x[w]

If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.

But the anti anti join appeals too :-)

116

answered Oct 06 '22 02:10

Matt Dowle

One solution I can think of is:

tmp <- x[!y] x[!tmp]

In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:

x <- data.table(x = 1:10, y = letters[1:10]) setkey(x, x) y <- data.table(x = c(1,3,5,1), z = 1:4)  > x[y]    x y z 1: 1 a 1 2: 3 c 2 3: 5 e 3 4: 1 a 4

The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:

> x[!y]     x y 1:  2 b 2:  4 d 3:  6 f 4:  7 g 5:  8 h 6:  9 i 7: 10 j

It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.

The ! feature was added in data.table 1.8.4 with the following note in NEWS:

o   A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.         DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works         DT[!"a"]                              # same result, now preferred.         DT[!J(6),...]                         # !J == not-join         DT[!2:3,...]                          # ! on all types of i         DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)         DT[!J(6L,23L)]                        # same result, faster binary search     '!' has been used rather than '-' :         * to match the 'not-join'/'not-where' nomenclature         * with '-', DT[-0] would return DT rather than DT[0] and not be backwards           compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in           base R) and after this new feature.         * to leave DT[+J...] and DT[-J...] available for future use

For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.

P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].

answered Oct 06 '22 03:10

Victor K.

Related questions
                            
                                Conditionally display a block of text in R Markdown
                            
                                Linear mixed model with crossed repeated effects and AR1 covariance structure, in R
                            
                                R hangs when there are too many arguments in setMethod (or setGeneric)
                            
                                Is there something like requirements.txt for R? [closed]
                            
                                How to test graphical output of functions?
                            
                                How to get started with Big Data Analysis [closed]
                            
                                Update a specific R package and its dependencies
                            
                                How to Change .libPaths() permanently in R?
                            
                                R 3.4.1 "Single Candle" Personal Library Path Error: unable to create ‘NA’
                            
                                R: is there something like iPython notebook (jupyter) for R? [closed]
                            
                                In R markdown in RStudio, how can I prevent the source code from running off a pdf page?
                            
                                lapply with "$" function
                            
                                What's the difference in using a semicolon or explicit new line in R code
                            
                                Difference between c() and append()
                            
                                Add link to R Shiny Application so link opens in a new browser tab
                            
                                Rbuildignore and Excluding Directories
                            
                                Complete remove and reinstall R, including all packages
                            
                                Replace single backslash in R
                            
                                Why do powers of 10 print in scientific notation at the 5th power?
                            
                                Is there a vectorized parallel max() and min()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Perform a semi-join with data.table

Tags:

r

data.table

semi-join

hadley

People also ask

2 Answers

Matt Dowle

Victor K.

Recent Activity

Donate For Us