How do I do a negative / nomatch / inverse search in data.table?

Question

What happens if I want to select all the rows in a data.table that do not contain a particular value in the key variable using binary search? By the way, what is the correct jargon for what I want to do? Is it "nojoin"? Is it "negative selection"?

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)

Lets do a positive selection for all rows where x=="a" but using binary search

DT["a"]

That's beautiful but I want the opposite of that. I want all the rows that are not "a" in other words where x!="a"

DT[x!="a"]

That is a vector scanning. The above line works but is uses vector scanning. I want to use binary. I was expecting the following to work, but alas...

DT[!"a"]
DT[-"a"]

The above two do not work and trying to play with nomatch got me nowhere.

Andrie · Accepted Answer

The idiom is this:

DT[-DT["a", which=TRUE]]

   x y v
1: b 1 4
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9

Inspiration from:

The mailing list posting Return Select/Join that does NOT match?
The previous question non-joins with data.tables
Matthew Dowle's answer to Porting set operations from R's data frames to data tables: How to identify duplicated rows?

Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (! rather than -) has been implemented :

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...]                            # same result, now preferred.

See the NEWS item for more detailed info and example.

Josh O'Brien · Answer

Andrie's answer is great, and is what I'd probably use. Interestingly, though, the following construct seems to be (just a bit) faster, especially as the size of the data.tables increase.

DT[J(x = unique(DT)[x!="a"][,x])]

##-------------------------------- Timings -----------------------------------##

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b","c"),each=45e5), y=c(1,3,6), v=1:9, key="x")
Josh <- function() DT[J(x = unique(DT)[x!="a"][,x])]
Andrie <- function() DT[-DT["a", which=TRUE]]

## Compare results
identical(Josh(), setkey(Andrie(), "x"))  
# [1] TRUE

## Compare timings
benchmark(replications = 10, order="relative", Josh=Josh(), Andrie=Andrie())
    test replications elapsed relative user.self sys.self user.child sys.child
1   Josh           10   17.50    1.000     14.78      3.6         NA        NA
2 Andrie           10   18.75    1.071     16.52      3.2         NA        NA

I'd be especially tempted to use this if DT[,x] could be made to return a data.table rather than a vector. Then, the construct could be simplified a bit to DT[unique(DT[,x])[x!="a"]]. Also, it would then work even when there are mulitiple columns in the key, which it currently does not.

How do I do a negative / nomatch / inverse search in data.table?

Tags:

Farrel

2 Answers

Andrie

Josh O'Brien

Recent Activity

Donate For Us

How do I do a negative / nomatch / inverse search in data.table?

Tags:

Farrel

2 Answers

Andrie

Josh O'Brien

Related questions

Recent Activity

Donate For Us