Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Left join using data.table

Suppose I have two data.table's:

A:

  A  B 1: 1 12 2: 2 13 3: 3 14 4: 4 15 

B:

   A  B 1: 2 13 2: 3 14 

and I have the following code:

merge_test = merge(dataA, dataB, by="A", all.data=TRUE) 

I get:

   A B.x B.y 1: 2  13  13 2: 3  14  14 

However, I want all the rows in dataA in the final merged table. Is there a way to do this?

like image 808
lord12 Avatar asked Jan 04 '16 19:01

lord12


People also ask

How do you do a left join in a data table?

If you want to do a left join, you can use all. x = TRUE . If you want to do a full outer join, you can use all = TRUE .

How do I join two data tables in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

Does LEFT join add rows?

A left outer join or left join retains all of the rows of the left table company, regardless of whether there is a row that matches on the right table foods.


1 Answers

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b] 

which gives:

> A    a  b bb 1: 1 12 NA 2: 2 13 13 3: 3 14 14 4: 4 15 NA 

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A) [1] "0x102afa5d0" > A[B, on = 'a', bb := i.b] > address(A) [1] "0x102afa5d0" 

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A) [1] "0x102abae50" > A <- B[A, on = 'a'] > address(A) [1] "0x102aa7e30" 

Using merge (merge.data.table) results in a similar change in memory location:

> address(A) [1] "0x111897e00" > A <- merge(A, B, by = 'a', all.x = TRUE) > address(A) [1] "0x1118ab000" 

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b]  

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.


To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench) bm <- mark(AA <- BB[AA, on = .(aa)],            AA[BB, on = .(aa), cc := cc],            iterations = 1) 

which gives (only relevant measurements shown):

> bm[,c(1,3,5)] # A tibble: 2 x 3   expression                         median mem_alloc   <bch:expr>                       <bch:tm> <bch:byt> 1 AA <- BB[AA, on = .(aa)]            4.98s     4.1GB 2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms   384.6MB 

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

NOTE: Gains in speed and memory use might differ in different setups.


Used data:

# initial datasets A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:3, b = 13:14)  # large datasets for the benchmark set.seed(2019) AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE)) BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE)) 
like image 66
Jaap Avatar answered Sep 22 '22 09:09

Jaap