I have a question on the <code>data.table</code> idiom for "non-joins", inspired from Iterator's question. Here is an example: <pre class="prettyprint"><code>library(data.table) dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE)) dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE)) setkey(dt1, A1) setkey(dt2, A2) </code></pre> The <code>data.table</code>s look like this <pre class="prettyprint"><code>> dt1 > dt2 A1 B1 A2 B2 [1,] a 1 [1,] a 2 [2,] b 4 [2,] b 5 [3,] c 2 [3,] c 2 [4,] d 5 [4,] d 1 [5,] e 1 [5,] e 1 [6,] f 2 [6,] k 5 [7,] g 3 [7,] l 2 [8,] h 3 [8,] m 4 [9,] i 2 [9,] n 1 [10,] j 4 [10,] o 1 </code></pre> To find which rows in <code>dt2</code> have the same key in <code>dt1</code>, set the <code>which</code> option to <code>TRUE</code>: <pre class="prettyprint"><code>> dt1[dt2, which=TRUE] [1] 1 2 3 4 5 NA NA NA NA NA </code></pre> Matthew suggested in this answer, that a "non join" idiom <pre class="prettyprint"><code>dt1[-dt1[dt2, which=TRUE]] </code></pre> to subset <code>dt1</code> to those rows that have indexes that don't appear in <code>dt2</code>. On my machine with <code>data.table</code> v1.7.1 I get an error: <pre class="prettyprint"><code>Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts </code></pre> Instead, with the option <code>nomatch=0</code>, the "non join" works <pre class="prettyprint"><code>> dt1[-dt1[dt2, which=TRUE, nomatch=0]] A1 B1 [1,] f 2 [2,] g 3 [3,] h 3 [4,] i 2 [5,] j 4 </code></pre> Is this intended behavior?

As far as I know, this is a part of base R. <pre class="prettyprint"><code># This works (1:4)[c(-2,-3)] # But this gives you the same error you described above (1:4)[c(-2, -3, NA)] # Error in (1:4)[c(-2, -3, NA)] : # only 0's may be mixed with negative subscripts </code></pre> The textual error message indicates that it is intended behavior. Here's my best guess as to why that is the intended behavior: From the way they treat <code>NA</code>'s elsewhere (e.g. typically defaulting to <code>na.rm=FALSE</code>), it seems that R's designers view <code>NA</code>'s as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting <code>nomatch=0</code> gives you a clean way to pass that instruction along!) In this context, the designers' preference probably explains why <code>NA</code>'s are accepted for positive indexing, but not for negative indexing: <pre class="prettyprint"><code># Positive indexing: works, because the return value retains info about NA's (1:4)[c(2,3,NA)] # Negative indexing: doesn't work, because it can't easily retain such info (1:4)[c(-2,-3,NA)] </code></pre>

non-joins with data.tables

Tags:

r

data.table

I have a question on the data.table idiom for "non-joins", inspired from Iterator's question. Here is an example:

library(data.table)

dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE))
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE))

setkey(dt1, A1)
setkey(dt2, A2)

The data.tables look like this

> dt1               > dt2
      A1 B1               A2 B2
 [1,]  a  1          [1,]  a  2
 [2,]  b  4          [2,]  b  5
 [3,]  c  2          [3,]  c  2
 [4,]  d  5          [4,]  d  1
 [5,]  e  1          [5,]  e  1
 [6,]  f  2          [6,]  k  5
 [7,]  g  3          [7,]  l  2
 [8,]  h  3          [8,]  m  4
 [9,]  i  2          [9,]  n  1
[10,]  j  4         [10,]  o  1

To find which rows in dt2 have the same key in dt1, set the which option to TRUE:

> dt1[dt2, which=TRUE]
[1]  1  2  3  4  5 NA NA NA NA NA

Matthew suggested in this answer, that a "non join" idiom

dt1[-dt1[dt2, which=TRUE]]

to subset dt1 to those rows that have indexes that don't appear in dt2. On my machine with data.table v1.7.1 I get an error:

Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts

Instead, with the option nomatch=0, the "non join" works

> dt1[-dt1[dt2, which=TRUE, nomatch=0]]
     A1 B1
[1,]  f  2
[2,]  g  3
[3,]  h  3
[4,]  i  2
[5,]  j  4

Is this intended behavior?

757

asked Oct 27 '11 18:10

Ryogi

2 Answers

New in v1.8.3 :

A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384.
  DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
  DT[!"a"]                              # same result, now preferred.
  DT[!J(6),...]                         # !J == not-join
  DT[!2:3,...]                          # ! on all types of i
  DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach
  DT[!J(6L,23L)]                        # same result, faster binary search
'!' has been used rather than '-' :
  * to match the 'not-join' and 'not-where' nomenclature
  * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
    compatibile. With '!', DT[!0] returns DT both before (since !0 is TRUE in
    base R) and after this new feature.
  * to leave DT[+...] and DT[-...] available for future use

123

answered Oct 26 '22 11:10

Matt Dowle

As far as I know, this is a part of base R.

# This works
(1:4)[c(-2,-3)]

# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] : 
#   only 0's may be mixed with negative subscripts

The textual error message indicates that it is intended behavior.

Here's my best guess as to why that is the intended behavior:

From the way they treat NA's elsewhere (e.g. typically defaulting to na.rm=FALSE), it seems that R's designers view NA's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0 gives you a clean way to pass that instruction along!)

In this context, the designers' preference probably explains why NA's are accepted for positive indexing, but not for negative indexing:

# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]

# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]

answered Oct 26 '22 11:10

Josh O'Brien

Related questions
                            
                                geom_line - different colour in the same line
                            
                                Calculate the week number (0-53) in year
                            
                                How can I reorder the x axis in a plot in R?
                            
                                In R, how to add the fitted value column to the original dataframe?
                            
                                Get 95% confidence interval with glm(..) in R
                            
                                How can I shut down Rserve gracefully?
                            
                                In R: remove commas from a field AND have the modified field remain part of the dataframe
                            
                                R: Can exists() function be used within mutate() (dplyr package)?
                            
                                R: Check existence of url, problems with httr:GET() and url.exists()
                            
                                dplyr n_distinct with condition
                            
                                How to get the position of elements in a list?
                            
                                Fastest way to read in 100,000 .dat.gz files
                            
                                dplyr arrange by reverse alphabetical order [duplicate]
                            
                                Solving Josephus permutation
                            
                                Adding data labels above geom_col() chart with ggplot2
                            
                                Convert a list of sf objects into one sf
                            
                                Rscript behaves inconsistently on windows with single and double quotes
                            
                                Can Ruby interface with r?
                            
                                creating tree diagram for showing case count using R
                            
                                How do I get discrete factor levels to be treated as continuous?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With