Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do a negative / nomatch / inverse search in data.table?

Tags:

What happens if I want to select all the rows in a data.table that do not contain a particular value in the key variable using binary search? By the way, what is the correct jargon for what I want to do? Is it "nojoin"? Is it "negative selection"?

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)

Lets do a positive selection for all rows where x=="a" but using binary search

DT["a"]

That's beautiful but I want the opposite of that. I want all the rows that are not "a" in other words where x!="a"

DT[x!="a"]

That is a vector scanning. The above line works but is uses vector scanning. I want to use binary. I was expecting the following to work, but alas...

DT[!"a"]
DT[-"a"]

The above two do not work and trying to play with nomatch got me nowhere.

like image 514
Farrel Avatar asked Sep 07 '12 13:09

Farrel


2 Answers

The idiom is this:

DT[-DT["a", which=TRUE]]

   x y v
1: b 1 4
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9

Inspiration from:

  • The mailing list posting Return Select/Join that does NOT match?
  • The previous question non-joins with data.tables
  • Matthew Dowle's answer to Porting set operations from R's data frames to data tables: How to identify duplicated rows?

Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (! rather than -) has been implemented :

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...]                            # same result, now preferred.

See the NEWS item for more detailed info and example.

like image 131
Andrie Avatar answered Sep 22 '22 10:09

Andrie


Andrie's answer is great, and is what I'd probably use. Interestingly, though, the following construct seems to be (just a bit) faster, especially as the size of the data.tables increase.

DT[J(x = unique(DT)[x!="a"][,x])]

##-------------------------------- Timings -----------------------------------##

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b","c"),each=45e5), y=c(1,3,6), v=1:9, key="x")
Josh <- function() DT[J(x = unique(DT)[x!="a"][,x])]
Andrie <- function() DT[-DT["a", which=TRUE]]

## Compare results
identical(Josh(), setkey(Andrie(), "x"))  
# [1] TRUE

## Compare timings
benchmark(replications = 10, order="relative", Josh=Josh(), Andrie=Andrie())
    test replications elapsed relative user.self sys.self user.child sys.child
1   Josh           10   17.50    1.000     14.78      3.6         NA        NA
2 Andrie           10   18.75    1.071     16.52      3.2         NA        NA

I'd be especially tempted to use this if DT[,x] could be made to return a data.table rather than a vector. Then, the construct could be simplified a bit to DT[unique(DT[,x])[x!="a"]]. Also, it would then work even when there are mulitiple columns in the key, which it currently does not.

like image 30
Josh O'Brien Avatar answered Sep 20 '22 10:09

Josh O'Brien