Suppose I have two data.table's:
A:
A B 1: 1 12 2: 2 13 3: 3 14 4: 4 15
B:
A B 1: 2 13 2: 3 14
and I have the following code:
merge_test = merge(dataA, dataB, by="A", all.data=TRUE)
I get:
A B.x B.y 1: 2 13 13 2: 3 14 14
However, I want all the rows in dataA in the final merged table. Is there a way to do this?
If you want to do a left join, you can use all. x = TRUE . If you want to do a full outer join, you can use all = TRUE .
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
A left outer join or left join retains all of the rows of the left table company, regardless of whether there is a row that matches on the right table foods.
If you want to add the b
values of B
to A
, then it's best to join A
with B
and update A
by reference as follows:
A[B, on = 'a', bb := i.b]
which gives:
> A a b bb 1: 1 12 NA 2: 2 13 13 3: 3 14 14 4: 4 15 NA
This is a better approach than using B[A, on='a']
because the latter just prints the result to the console. When you want to get the results back into A
, you need to use A <- B[A, on='a']
which will give you the same result.
The reason why A[B, on = 'a', bb := i.b]
is better than A <- B[A, on = 'a']
is memory efficiency. With A[B, on = 'a', bb := i.b]
the location of A
in memory stays the same:
> address(A) [1] "0x102afa5d0" > A[B, on = 'a', bb := i.b] > address(A) [1] "0x102afa5d0"
While on the other hand with A <- B[A, on = 'a']
, a new object is created and saved in memory as A
and hence has another location in memory:
> address(A) [1] "0x102abae50" > A <- B[A, on = 'a'] > address(A) [1] "0x102aa7e30"
Using merge
(merge.data.table
) results in a similar change in memory location:
> address(A) [1] "0x111897e00" > A <- merge(A, B, by = 'a', all.x = TRUE) > address(A) [1] "0x1118ab000"
For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:
A[B, on = 'a', bb := i.b]
Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table
was designed.
Probably also worth mentioning is that the order of A
stays the same.
To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):
library(bench) bm <- mark(AA <- BB[AA, on = .(aa)], AA[BB, on = .(aa), cc := cc], iterations = 1)
which gives (only relevant measurements shown):
> bm[,c(1,3,5)] # A tibble: 2 x 3 expression median mem_alloc <bch:expr> <bch:tm> <bch:byt> 1 AA <- BB[AA, on = .(aa)] 4.98s 4.1GB 2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms 384.6MB
So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.
NOTE: Gains in speed and memory use might differ in different setups.
Used data:
# initial datasets A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:3, b = 13:14) # large datasets for the benchmark set.seed(2019) AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE)) BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With