I have a question on the data.table
idiom for "non-joins", inspired from Iterator's question. Here is an example:
library(data.table)
dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE))
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE))
setkey(dt1, A1)
setkey(dt2, A2)
The data.table
s look like this
> dt1 > dt2
A1 B1 A2 B2
[1,] a 1 [1,] a 2
[2,] b 4 [2,] b 5
[3,] c 2 [3,] c 2
[4,] d 5 [4,] d 1
[5,] e 1 [5,] e 1
[6,] f 2 [6,] k 5
[7,] g 3 [7,] l 2
[8,] h 3 [8,] m 4
[9,] i 2 [9,] n 1
[10,] j 4 [10,] o 1
To find which rows in dt2
have the same key in dt1
, set the which
option to TRUE
:
> dt1[dt2, which=TRUE]
[1] 1 2 3 4 5 NA NA NA NA NA
Matthew suggested in this answer, that a "non join" idiom
dt1[-dt1[dt2, which=TRUE]]
to subset dt1
to those rows that have indexes that don't appear in dt2
. On my machine with data.table
v1.7.1 I get an error:
Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts
Instead, with the option nomatch=0
, the "non join" works
> dt1[-dt1[dt2, which=TRUE, nomatch=0]]
A1 B1
[1,] f 2
[2,] g 3
[3,] h 3
[4,] i 2
[5,] j 4
Is this intended behavior?
table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
New in v1.8.3 :
A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384.
DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works
DT[!"a"] # same result, now preferred.
DT[!J(6),...] # !J == not-join
DT[!2:3,...] # ! on all types of i
DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach
DT[!J(6L,23L)] # same result, faster binary search
'!' has been used rather than '-' :
* to match the 'not-join' and 'not-where' nomenclature
* with '-', DT[-0] would return DT rather than DT[0] and not be backwards
compatibile. With '!', DT[!0] returns DT both before (since !0 is TRUE in
base R) and after this new feature.
* to leave DT[+...] and DT[-...] available for future use
As far as I know, this is a part of base R.
# This works
(1:4)[c(-2,-3)]
# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts
The textual error message indicates that it is intended behavior.
Here's my best guess as to why that is the intended behavior:
From the way they treat NA
's elsewhere (e.g. typically defaulting to na.rm=FALSE
), it seems that R's designers view NA
's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0
gives you a clean way to pass that instruction along!)
In this context, the designers' preference probably explains why NA
's are accepted for positive indexing, but not for negative indexing:
# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]
# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With