The <code>data.table</code> package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time. In this example, <code>X</code> and <code>Y</code> are <code>data.table</code>s with a single key column "id"; <code>Y</code> also has a non-key column "x_id". <pre class="prettyprint"><code> X <- data.table(id = 1:5, a=4:8,key="id") Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id") </code></pre> The following syntax would join the tables on their keys: <pre class="prettyprint"><code> X[Y] </code></pre> How can I translate the following SQL syntax to data.table code? <pre class="prettyprint"><code> select * from X join Y on X.id = Y.x_id; </code></pre> The closest that I have gotten is: <pre class="prettyprint"><code>Y[X,list(id, x_id),by = x_id,nomatch=0] </code></pre> However, this does not do the same inner join as the SQL statement. <hr> Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where <code>X2$y_id = Y2$id</code>. <pre class="prettyprint"><code> X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id") Y2 <- data.table(id = 1:5, b = letters[1:5], key="id") </code></pre> I would like to produce the table: <pre class="prettyprint"><code> id y_id b 1 1 "a" 2 1 "a" 3 2 "b" 4 2 "b" 5 2 "b" </code></pre> similar to what is done by the following kludge: <pre class="prettyprint"><code>> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id") y_id id b 1 1 1 a 2 1 2 a 3 2 3 b 4 2 4 b 5 2 5 b </code></pre> However, when I do this: <pre class="prettyprint"><code> X2[Y2, 1:2,by = y_id] </code></pre> I do not get the desired result: <pre class="prettyprint"><code> y_id V1 [1,] 1 1 [2,] 1 2 [3,] 2 1 [4,] 2 2 </code></pre>

Good question. Note the following (admittedly buried) in <code>?data.table</code> : <blockquote> When <code>i</code> is a <code>data.table</code>, <code>x</code> must have a key. <code>i</code> is joined to <code>x</code> using the key and the rows in <code>x</code> that match are returned. An equi-join is performed between each column in <code>i</code> to each column in <code>x</code>'s key. The match is a binary search in compiled C in O(log n) time. If <code>i</code> has less columns than <code>x</code>'s key then many rows of <code>x</code> may match to each row of <code>i</code>. If <code>i</code> has more columns than <code>x</code>'s key, the columns of <code>i</code> not involved in the join are included in the result. If <code>i</code> also has a key, it is <code>i</code>'s key columns that are used to match to <code>x</code>'s key columns and a binary merge of the two tables is carried out. </blockquote> So, the key here is that <code>i</code> doesn't have to be keyed. Only <code>x</code> must be keyed. <pre class="prettyprint"><code>X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id") id y_id [1,] 11 14 [2,] 12 14 [3,] 13 11 [4,] 14 12 [5,] 15 12 Y2 <- data.table(id = 11:15, b = letters[1:5], key="id") id b [1,] 11 a [2,] 12 b [3,] 13 c [4,] 14 d [5,] 15 e Y2[J(X2$y_id)] # binary search for each item of (unsorted and unkeyed) i id b [1,] 14 d [2,] 14 d [3,] 11 a [4,] 12 b [5,] 12 b </code></pre> or, <pre class="prettyprint"><code>Y2[SJ(X2$y_id)] # binary merge of keyed i, see ?SJ id b [1,] 11 a [2,] 12 b [3,] 12 b [4,] 14 d [5,] 14 d identical(Y2[J(X2$y_id)], Y2[X2$y_id]) [1] FALSE </code></pre>

Translating SQL joins on foreign keys to R data.table syntax

Tags:

The data.table package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time.

In this example, X and Y are data.tables with a single key column "id"; Y also has a non-key column "x_id".

   X <- data.table(id = 1:5, a=4:8,key="id")
   Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id")

The following syntax would join the tables on their keys:

  X[Y]

How can I translate the following SQL syntax to data.table code?

  select * from X join Y on X.id = Y.x_id;

The closest that I have gotten is:

Y[X,list(id, x_id),by = x_id,nomatch=0]

However, this does not do the same inner join as the SQL statement.

Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where X2$y_id = Y2$id.

    X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id")
    Y2 <- data.table(id = 1:5, b = letters[1:5], key="id")

I would like to produce the table:

   id  y_id  b
    1     1 "a"
    2     1 "a"
    3     2 "b"
    4     2 "b"
    5     2 "b"

similar to what is done by the following kludge:

> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id")
  y_id id b
1    1  1 a
2    1  2 a
3    2  3 b
4    2  4 b
5    2  5 b

However, when I do this:

    X2[Y2, 1:2,by = y_id]

I do not get the desired result:

    y_id V1
[1,]    1  1
[2,]    1  2
[3,]    2  1
[4,]    2  2

864

asked Mar 28 '12 19:03

David LeBauer

1 Answers

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
     id y_id
[1,] 11   14
[2,] 12   14
[3,] 13   11
[4,] 14   12
[5,] 15   12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
     id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)]  # binary search for each item of (unsorted and unkeyed) i
     id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
     id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE

180

answered Oct 19 '22 11:10

Matt Dowle

Related questions
                            
                                Google-app-engine NDB
                            
                                Cannot override library's xml resource with png resource in application?
                            
                                Use JSON.NET to generate JSON schema with extra attributes
                            
                                Using LLVM linker when using Clang & CMake
                            
                                Build and reference my own local package in Go
                            
                                Firebase .NET access
                            
                                Difference between Gunicorn and Nginx
                            
                                Object.prototype is Verboten?
                            
                                Lazy fetching single column (class attribute) with Hibernate
                            
                                How to use WebCL in Chrome?
                            
                                Nested IF ( IF ( ... ) ELSE( .. ) ) statement in batch
                            
                                How to read all files in a folder using C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With