What happens if I want to select all the rows in a data.table that do not contain a particular value in the key variable using binary search? By the way, what is the correct jargon for what I want to do? Is it "nojoin"? Is it "negative selection"?
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)
Lets do a positive selection for all rows where x=="a" but using binary search
DT["a"]
That's beautiful but I want the opposite of that. I want all the rows that are not "a" in other words where x!="a"
DT[x!="a"]
That is a vector scanning. The above line works but is uses vector scanning. I want to use binary. I was expecting the following to work, but alas...
DT[!"a"]
DT[-"a"]
The above two do not work and trying to play with nomatch got me nowhere.
The idiom is this:
DT[-DT["a", which=TRUE]]
x y v
1: b 1 4
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9
Inspiration from:
Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (!
rather than -
) has been implemented :
DT[-DT["a",which=TRUE,nomatch=0],...] # old idiom
DT[!"a",...] # same result, now preferred.
See the NEWS item for more detailed info and example.
Andrie's answer is great, and is what I'd probably use. Interestingly, though, the following construct seems to be (just a bit) faster, especially as the size of the data.tables increase.
DT[J(x = unique(DT)[x!="a"][,x])]
##-------------------------------- Timings -----------------------------------##
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b","c"),each=45e5), y=c(1,3,6), v=1:9, key="x")
Josh <- function() DT[J(x = unique(DT)[x!="a"][,x])]
Andrie <- function() DT[-DT["a", which=TRUE]]
## Compare results
identical(Josh(), setkey(Andrie(), "x"))
# [1] TRUE
## Compare timings
benchmark(replications = 10, order="relative", Josh=Josh(), Andrie=Andrie())
test replications elapsed relative user.self sys.self user.child sys.child
1 Josh 10 17.50 1.000 14.78 3.6 NA NA
2 Andrie 10 18.75 1.071 16.52 3.2 NA NA
I'd be especially tempted to use this if DT[,x]
could be made to return a data.table rather than a vector. Then, the construct could be simplified a bit to DT[unique(DT[,x])[x!="a"]]
. Also, it would then work even when there are mulitiple columns in the key, which it currently does not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With