Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data.table - left outer join on multiple tables

Tags:

r

data.table

Suppose you have data like

fruits <- data.table(FruitID=c(1,2,3), Fruit=c("Apple", "Banana", "Strawberry"))
colors <- data.table(ColorID=c(1,2,3,4,5), FruitID=c(1,1,1,2,3), Color=c("Red","Yellow","Green","Yellow","Red"))
tastes <- data.table(TasteID=c(1,2,3), FruitID=c(1,1,3), Taste=c("Sweeet", "Sour", "Sweet"))

setkey(fruits, "FruitID")
setkey(colors, "ColorID")
setkey(tastes, "TasteID")

fruits
   FruitID      Fruit
1:       1      Apple
2:       2     Banana
3:       3 Strawberry

colors
   ColorID FruitID  Color
1:       1       1    Red
2:       2       1 Yellow
3:       3       1  Green
4:       4       2 Yellow
5:       5       3    Red

tastes
   TasteID FruitID  Taste
1:       1       1 Sweeet
2:       2       1   Sour
3:       3       3  Sweet

I typically need to perform left-outer joins on data like this. For instance, "give me all fruits and their colors" requires me to write (and maybe there's a better way?)

setkey(colors, "FruitID")
result <- colors[fruits, allow.cartesian=TRUE]
setkey(colors, "ColorID")

Three lines of code for such a simple and frequent task seemed excessive, so I wrote a method myLeftJoin

myLeftJoin <- function(tbl1, tbl2){
  # Performs a left join using the key in tbl1 (i.e. keeps all rows from tbl1 and only matching rows from tbl2)

  oldkey <- key(tbl2)
  setkeyv(tbl2, key(tbl1))
  result <- tbl2[tbl1, allow.cartesian=TRUE]
  setkeyv(tbl2, oldkey)
  return(result)
}

which I can use like

myLeftJoin(fruits, colors)
   ColorID FruitID  Color      Fruit
1:       1       1    Red      Apple
2:       2       1 Yellow      Apple
3:       3       1  Green      Apple
4:       4       2 Yellow     Banana
5:       5       3    Red Strawberry

How can I extend this method so that I can pass any number of tables to it and get the chained left outer join of all of them? Something like myLeftJoin(tbl1, ...)

For instance, I'd like the result of myleftJoin(fruits, colors, tastes) to be equivalent to

setkey(colors, "FruitID")
setkey(tastes, "FruitID")
result <- tastes[colors[fruits, allow.cartesian=TRUE], allow.cartesian=TRUE]
setkey(tastes, "TasteID")
setkey(colors, "ColorID")

result
   TasteID FruitID  Taste ColorID  Color      Fruit
1:       1       1 Sweeet       1    Red      Apple
2:       2       1   Sour       1    Red      Apple
3:       1       1 Sweeet       2 Yellow      Apple
4:       2       1   Sour       2 Yellow      Apple
5:       1       1 Sweeet       3  Green      Apple
6:       2       1   Sour       3  Green      Apple
7:      NA       2     NA       4 Yellow     Banana
8:       3       3  Sweet       5    Red Strawberry

Perhaps there's an elegant solution using methods in the data.table package that I missed? Thanks

(EDIT: Fixed a mistake in my data)

like image 670
Ben Avatar asked Jul 17 '15 16:07

Ben


People also ask

What does left outer join on 2 tables do?

LEFT (OUTER) JOIN : Returns all records from the left table, and the matched records from the right table. RIGHT (OUTER) JOIN : Returns all records from the right table, and the matched records from the left table. FULL (OUTER) JOIN : Returns all records when there is a match in either left or right table.

Is outer joined to multiple tables?

Using OUTER-JOINS, it is possible to join only two tables at a time. In order to join three tables, it is necessary to first select two tables to join and then join the third table to the result of the first join.

Can you have multiple left outer JOINs?

Yes, indeed! You can use multiple LEFT JOINs in one query if needed for your analysis.

Can we apply join on more than 2 tables?

The join operation is used to combine related rows from two tables into a result set. Join is a binary operation. More than two tables can be combined using multiple join operations.


1 Answers

I just committed a new feature in data.table, v1.9.5, with which we can join without setting keys (that is, specify the columns to join by directly, without having to use setkey() first):

With that, this is simply:

require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
#    FruitID      Fruit TasteID  Taste ColorID  Color
# 1:       1      Apple       1 Sweeet       1    Red
# 2:       1      Apple       2   Sour       1    Red
# 3:       1      Apple       1 Sweeet       2 Yellow
# 4:       1      Apple       2   Sour       2 Yellow
# 5:       1      Apple       1 Sweeet       3  Green
# 6:       1      Apple       2   Sour       3  Green
# 7:       2         NA      NA     NA       4 Yellow
# 8:       3 Strawberry       3  Sweet       5    Red
like image 133
Arun Avatar answered Nov 10 '22 00:11

Arun