Suppose I have two data.table's: A: <pre class="prettyprint"><code> A B 1: 1 12 2: 2 13 3: 3 14 4: 4 15 </code></pre> B: <pre class="prettyprint"><code> A B 1: 2 13 2: 3 14 </code></pre> and I have the following code: <pre class="prettyprint"><code>merge_test = merge(dataA, dataB, by="A", all.data=TRUE) </code></pre> I get: <pre class="prettyprint"><code> A B.x B.y 1: 2 13 13 2: 3 14 14 </code></pre> However, I want all the rows in dataA in the final merged table. Is there a way to do this?

If you want to add the <code>b</code> values of <code>B</code> to <code>A</code>, then it's best to join <code>A</code> with <code>B</code> and update <code>A</code> by reference as follows: <pre class="prettyprint"><code>A[B, on = 'a', bb := i.b] </code></pre> which gives: <blockquote> <pre class="prettyprint"><code>> A a b bb 1: 1 12 NA 2: 2 13 13 3: 3 14 14 4: 4 15 NA </code></pre> </blockquote> This is a better approach than using <code>B[A, on='a']</code> because the latter just prints the result to the console. When you want to get the results back into <code>A</code>, you need to use <code>A <- B[A, on='a']</code> which will give you the same result. The reason why <code>A[B, on = 'a', bb := i.b]</code> is better than <code>A <- B[A, on = 'a']</code> is memory efficiency. With <code>A[B, on = 'a', bb := i.b]</code> the location of <code>A</code> in memory stays the same: <blockquote> <pre class="prettyprint"><code>> address(A) [1] "0x102afa5d0" > A[B, on = 'a', bb := i.b] > address(A) [1] "0x102afa5d0" </code></pre> </blockquote> While on the other hand with <code>A <- B[A, on = 'a']</code>, a new object is created and saved in memory as <code>A</code> and hence has another location in memory: <blockquote> <pre class="prettyprint"><code>> address(A) [1] "0x102abae50" > A <- B[A, on = 'a'] > address(A) [1] "0x102aa7e30" </code></pre> </blockquote> Using <code>merge</code> (<code>merge.data.table</code>) results in a similar change in memory location: <blockquote> <pre class="prettyprint"><code>> address(A) [1] "0x111897e00" > A <- merge(A, B, by = 'a', all.x = TRUE) > address(A) [1] "0x1118ab000" </code></pre> </blockquote> For memory efficiency it is thus better to use an 'update-by-reference-join' syntax: <pre class="prettyprint"><code>A[B, on = 'a', bb := i.b] </code></pre> Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which <code>data.table</code> was designed. Probably also worth mentioning is that the order of <code>A</code> stays the same. <hr> To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below): <pre class="prettyprint"><code>library(bench) bm <- mark(AA <- BB[AA, on = .(aa)], AA[BB, on = .(aa), cc := cc], iterations = 1) </code></pre> which gives (only relevant measurements shown): <blockquote> <pre class="prettyprint"><code>> bm[,c(1,3,5)] # A tibble: 2 x 3 expression median mem_alloc <bch:expr> <bch:tm> <bch:byt> 1 AA <- BB[AA, on = .(aa)] 4.98s 4.1GB 2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms 384.6MB </code></pre> </blockquote> So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory. NOTE: Gains in speed and memory use might differ in different setups. <hr> Used data: <pre class="prettyprint"><code># initial datasets A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:3, b = 13:14) # large datasets for the benchmark set.seed(2019) AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE)) BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE)) </code></pre>

Left join using data.table

Tags:

merge

join

r

data.table

Suppose I have two data.table's:

  A  B 1: 1 12 2: 2 13 3: 3 14 4: 4 15

   A  B 1: 2 13 2: 3 14

and I have the following code:

merge_test = merge(dataA, dataB, by="A", all.data=TRUE)

I get:

   A B.x B.y 1: 2  13  13 2: 3  14  14

However, I want all the rows in dataA in the final merged table. Is there a way to do this?

808

asked Jan 04 '16 19:01

lord12

1 Answers

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> A    a  b bb 1: 1 12 NA 2: 2 13 13 3: 3 14 14 4: 4 15 NA

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> address(A) [1] "0x102afa5d0" > A[B, on = 'a', bb := i.b] > address(A) [1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> address(A) [1] "0x102abae50" > A <- B[A, on = 'a'] > address(A) [1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> address(A) [1] "0x111897e00" > A <- merge(A, B, by = 'a', all.x = TRUE) > address(A) [1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b]

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.

To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench) bm <- mark(AA <- BB[AA, on = .(aa)],            AA[BB, on = .(aa), cc := cc],            iterations = 1)

which gives (only relevant measurements shown):

> bm[,c(1,3,5)] # A tibble: 2 x 3   expression                         median mem_alloc   <bch:expr>                       <bch:tm> <bch:byt> 1 AA <- BB[AA, on = .(aa)]            4.98s     4.1GB 2 AA[BB, on = .(aa), `:=`(cc, cc)] 560.88ms   384.6MB

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

_{NOTE: Gains in speed and memory use might differ in different setups.}

Used data:

# initial datasets A <- data.table(a = 1:4, b = 12:15) B <- data.table(a = 2:3, b = 13:14)  # large datasets for the benchmark set.seed(2019) AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE)) BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))

answered Sep 22 '22 09:09

Jaap

Related questions
                            
                                R define dimensions of empty data frame
                            
                                How can one work fully generically in data.table in R with column names in variables
                            
                                Is it possible to use spread on multiple columns in tidyr similar to dcast? [duplicate]
                            
                                Rearrange dataframe to a table, the opposite of "melt" [duplicate]
                            
                                two-column layouts in RStudio presentations/slidify/pandoc
                            
                                Using functions of multiple columns in a dplyr mutate_at call
                            
                                Diagnosing R package build warning: "LaTeX errors when creating PDF version"
                            
                                How to merge color, line style and shape legends in ggplot
                            
                                R and Python in one Jupyter notebook
                            
                                R: Break for loop
                            
                                Add panel border to ggplot2
                            
                                Select the top N values by group
                            
                                calculate the mean for each column of a matrix in R
                            
                                R Not in subset [duplicate]
                            
                                How to merge 2 vectors alternating indexes?
                            
                                ggplot2, change title size
                            
                                Putting x-axis at top of ggplot2 chart
                            
                                Cleaning up factor levels (collapsing multiple levels/labels)
                            
                                Place a border around points
                            
                                Adding time to POSIXct object in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With