Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Translating SQL joins on foreign keys to R data.table syntax

Tags:

The data.table package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time.

In this example, X and Y are data.tables with a single key column "id"; Y also has a non-key column "x_id".

   X <- data.table(id = 1:5, a=4:8,key="id")
   Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id")

The following syntax would join the tables on their keys:

  X[Y]

How can I translate the following SQL syntax to data.table code?

  select * from X join Y on X.id = Y.x_id; 

The closest that I have gotten is:

Y[X,list(id, x_id),by = x_id,nomatch=0]

However, this does not do the same inner join as the SQL statement.


Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where X2$y_id = Y2$id.

    X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id")
    Y2 <- data.table(id = 1:5, b = letters[1:5], key="id")

I would like to produce the table:

   id  y_id  b
    1     1 "a"
    2     1 "a"
    3     2 "b"
    4     2 "b"
    5     2 "b"

similar to what is done by the following kludge:

> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id")
  y_id id b
1    1  1 a
2    1  2 a
3    2  3 b
4    2  4 b
5    2  5 b

However, when I do this:

    X2[Y2, 1:2,by = y_id]

I do not get the desired result:

    y_id V1
[1,]    1  1
[2,]    1  2
[3,]    2  1
[4,]    2  2
like image 864
David LeBauer Avatar asked Mar 28 '12 19:03

David LeBauer


People also ask

How are foreign keys used in table joins?

A FOREIGN KEY enforces data integrity, making sure the data confirms to some rules when it is added to the DB. A JOIN is used when you extract/query data from the DB by giving rules how to select the data. JOIN s work if there are FK or not. FK's work if you extract data with or without JOIN s.

Is foreign key used in joins?

A foreign key is a column or group of columns in one table that contains values that match the primary key in another table. Foreign keys are used to join tables. The following figure shows the primary and foreign keys of the customer and orders tables from the demonstration database.

How do you create a join table in SQL?

Inner Join clause in SQL Server creates a new table (not physical) by combining rows that have matching values in two or more tables. This join is based on a logical relationship (or a common field) between the tables and is used to retrieve data that appears in both tables.

Can foreign key refer to same table?

FOREIGN KEY constraints can reference another column in the same table, and is referred to as a self-reference. A FOREIGN KEY constraint specified at the column level can list only one reference column. This column must have the same data type as the column on which the constraint is defined.


1 Answers

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
     id y_id
[1,] 11   14
[2,] 12   14
[3,] 13   11
[4,] 14   12
[5,] 15   12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
     id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)]  # binary search for each item of (unsorted and unkeyed) i
     id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
     id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE
like image 180
Matt Dowle Avatar answered Oct 19 '22 11:10

Matt Dowle