I have two data.table X and Y. columns in X: <code>area, id, value</code> columns in Y: <code>ID, price, sales</code> Create the two data.tables: <pre class="prettyprint"><code>X = data.table(area=c('US', 'UK', 'EU'), id=c('c001', 'c002', 'c003'), value=c(100, 200, 300) ) Y = data.table(ID=c('c001', 'c002', 'c003'), price=c(500, 200, 400), sales=c(20, 30, 15) ) </code></pre> And I set keys for X and Y: <pre class="prettyprint"><code>setkey(X, id) setkey(Y, ID) </code></pre> Now I try to join X and Y by <code>id</code> in X and <code>ID</code> in Y: <pre class="prettyprint"><code>merge(X, Y) merge(X, Y, by=c('id', 'ID')) merge(X, Y, by.x='id', by.y='ID') </code></pre> All raised error saying that column names in the <code>by</code> argument invalid. I referred to the manual of data.table and found the <code>merge</code> function not supporting <code>by.x</code> and <code>by.y</code> arguments. How could I join two data.tables by different column names without changing the column names? Append: I managed to join the two tables by <code>X[Y]</code>, but why <code>merge</code> function fails in data.table?

As of <code>data.table</code> version 1.9.6 (on CRAN on sep 2015) you can specify the <code>by.x</code> and <code>by.y</code> arguments in <code>data.table::merge</code> <pre class="prettyprint"><code>merge(x=X, y=Y, by.x="id", by.y="ID")[] # id area value price sales #1: c001 US 100 500 20 #2: c002 UK 200 200 30 #3: c003 EU 300 400 15 </code></pre> However, in data.table 1.9.6 you can also specfy the <code>on</code> argument in the <code>X[Y]</code> notation <blockquote> X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables. </blockquote> <pre class="prettyprint"><code>X[Y, on=c(id = "ID")] # area id value price sales #1: US c001 100 500 20 #2: UK c002 200 200 30 #3: EU c003 300 400 15 </code></pre> this answer by the <code>data.table</code> author has more details

<h3>OUTDATED</h3> <hr> Use this operation: <pre class="prettyprint"><code>X[Y] # area id value price sales # 1: US c001 100 500 20 # 2: UK c002 200 200 30 # 3: EU c003 300 400 15 </code></pre> or this operation: <pre class="prettyprint"><code>Y[X] # ID price sales area value # 1: c001 500 20 US 100 # 2: c002 200 30 UK 200 # 3: c003 400 15 EU 300 </code></pre> Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout <code>?merge</code> and I discovered there are two different merge functions depending upon which package you are using. The default is <code>merge.data.frame</code> but data.table uses <code>merge.data.table</code>. Compare <pre class="prettyprint"><code>merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table # Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") : # A non-empty vector of column names for `by` is required. </code></pre> with <pre class="prettyprint"><code>merge.data.frame(X, Y, by.x = "id", by.y = "ID") # id area value price sales # 1 c001 US 100 500 20 # 2 c002 UK 200 200 30 # 3 c003 EU 300 400 15 </code></pre> Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the <code>data.table</code> team is planning on implementing <code>by.x</code> and <code>by.y</code> into the <code>merge.data.table</code> function, but hasn't done so yet.

How to merge two data.table by different column names?

Tags:

merge

r

data.table

I have two data.table X and Y.

columns in X: area, id, value
columns in Y: ID, price, sales

Create the two data.tables:

X = data.table(area=c('US', 'UK', 'EU'),                id=c('c001', 'c002', 'c003'),                value=c(100, 200, 300)               )  Y = data.table(ID=c('c001', 'c002', 'c003'),                price=c(500, 200, 400),                sales=c(20, 30, 15)               )

And I set keys for X and Y:

setkey(X, id) setkey(Y, ID)

Now I try to join X and Y by id in X and ID in Y:

merge(X, Y) merge(X, Y, by=c('id', 'ID')) merge(X, Y, by.x='id', by.y='ID')

All raised error saying that column names in the by argument invalid.

I referred to the manual of data.table and found the merge function not supporting by.x and by.y arguments.

How could I join two data.tables by different column names without changing the column names?

Append:
I managed to join the two tables by X[Y], but why merge function fails in data.table?

950

asked Apr 25 '15 13:04

Zelong

2 Answers

As of data.table version 1.9.6 (on CRAN on sep 2015) you can specify the by.x and by.y arguments in data.table::merge

merge(x=X, y=Y, by.x="id", by.y="ID")[] #     id area value price sales #1: c001   US   100   500    20 #2: c002   UK   200   200    30 #3: c003   EU   300   400    15

However, in data.table 1.9.6 you can also specfy the on argument in the X[Y] notation

X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.

X[Y, on=c(id = "ID")] #   area   id value price sales #1:   US c001   100   500    20 #2:   UK c002   200   200    30 #3:   EU c003   300   400    15

this answer by the data.table author has more details

177

answered Sep 24 '22 13:09

tospig

OUTDATED

Use this operation:

X[Y] #    area   id value price sales # 1:   US c001   100   500    20 # 2:   UK c002   200   200    30 # 3:   EU c003   300   400    15

or this operation:

Y[X] #      ID price sales area value # 1: c001   500    20   US   100 # 2: c002   200    30   UK   200 # 3: c003   400    15   EU   300

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table # Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :  # A non-empty vector of column names for `by` is required.

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID") #     id area value price sales # 1 c001   US   100   500    20 # 2 c002   UK   200   200    30 # 3 c003   EU   300   400    15

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

answered Sep 23 '22 13:09

Richard Erickson

Related questions
                            
                                Dynamic column names in data.table
                            
                                Dplyr join on by=(a = b), where a and b are variables containing strings?
                            
                                How to define a vectorized function in R
                            
                                Replace missing values (NA) with blank (empty string)
                            
                                what is the difference between names and colnames
                            
                                How to update a package in R?
                            
                                Extracting coefficient variable names from glmnet into a data.frame
                            
                                RStudio enters debug mode for every function error - how can I stop it?
                            
                                Why is using assign bad?
                            
                                Use data.table to count and aggregate / summarize a column
                            
                                matplotlib analog of R's `pairs`
                            
                                is it possible to redirect console output to a variable?
                            
                                How to include NA in ifelse?
                            
                                Adjusting width of tables made with kable() in RMarkdown documents
                            
                                using parallel's parLapply: unable to access variables within parallel code
                            
                                Fast reading and combining several files using data.table (with fread)
                            
                                Multiply many columns by a specific other column in R with data.table?
                            
                                meaning of ddply error: 'names' attribute [9] must be the same length as the vector [1]
                            
                                Convert four digit year values to class Date
                            
                                How to extract the first n rows per group?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With