Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

non-joins with data.tables

Tags:

r

data.table

I have a question on the data.table idiom for "non-joins", inspired from Iterator's question. Here is an example:

library(data.table)

dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE))
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE))

setkey(dt1, A1)
setkey(dt2, A2)

The data.tables look like this

> dt1               > dt2
      A1 B1               A2 B2
 [1,]  a  1          [1,]  a  2
 [2,]  b  4          [2,]  b  5
 [3,]  c  2          [3,]  c  2
 [4,]  d  5          [4,]  d  1
 [5,]  e  1          [5,]  e  1
 [6,]  f  2          [6,]  k  5
 [7,]  g  3          [7,]  l  2
 [8,]  h  3          [8,]  m  4
 [9,]  i  2          [9,]  n  1
[10,]  j  4         [10,]  o  1

To find which rows in dt2 have the same key in dt1, set the which option to TRUE:

> dt1[dt2, which=TRUE]
[1]  1  2  3  4  5 NA NA NA NA NA

Matthew suggested in this answer, that a "non join" idiom

dt1[-dt1[dt2, which=TRUE]]

to subset dt1 to those rows that have indexes that don't appear in dt2. On my machine with data.table v1.7.1 I get an error:

Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts

Instead, with the option nomatch=0, the "non join" works

> dt1[-dt1[dt2, which=TRUE, nomatch=0]]
     A1 B1
[1,]  f  2
[2,]  g  3
[3,]  h  3
[4,]  i  2
[5,]  j  4

Is this intended behavior?

like image 757
Ryogi Avatar asked Oct 27 '11 18:10

Ryogi


People also ask

Is data table faster than Dplyr?

table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .

How do I merge two data tables in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.


2 Answers

New in v1.8.3 :

A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384.
  DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
  DT[!"a"]                              # same result, now preferred.
  DT[!J(6),...]                         # !J == not-join
  DT[!2:3,...]                          # ! on all types of i
  DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach
  DT[!J(6L,23L)]                        # same result, faster binary search
'!' has been used rather than '-' :
  * to match the 'not-join' and 'not-where' nomenclature
  * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
    compatibile. With '!', DT[!0] returns DT both before (since !0 is TRUE in
    base R) and after this new feature.
  * to leave DT[+...] and DT[-...] available for future use
like image 123
Matt Dowle Avatar answered Oct 26 '22 11:10

Matt Dowle


As far as I know, this is a part of base R.

# This works
(1:4)[c(-2,-3)]

# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] : 
#   only 0's may be mixed with negative subscripts

The textual error message indicates that it is intended behavior.

Here's my best guess as to why that is the intended behavior:

From the way they treat NA's elsewhere (e.g. typically defaulting to na.rm=FALSE), it seems that R's designers view NA's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0 gives you a clean way to pass that instruction along!)

In this context, the designers' preference probably explains why NA's are accepted for positive indexing, but not for negative indexing:

# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]

# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]
like image 28
Josh O'Brien Avatar answered Oct 26 '22 11:10

Josh O'Brien