I have two data.table X and Y.
columns in X: area, id, value
columns in Y: ID, price, sales
Create the two data.tables:
X = data.table(area=c('US', 'UK', 'EU'), id=c('c001', 'c002', 'c003'), value=c(100, 200, 300) ) Y = data.table(ID=c('c001', 'c002', 'c003'), price=c(500, 200, 400), sales=c(20, 30, 15) )
And I set keys for X and Y:
setkey(X, id) setkey(Y, ID)
Now I try to join X and Y by id
in X and ID
in Y:
merge(X, Y) merge(X, Y, by=c('id', 'ID')) merge(X, Y, by.x='id', by.y='ID')
All raised error saying that column names in the by
argument invalid.
I referred to the manual of data.table and found the merge
function not supporting by.x
and by.y
arguments.
How could I join two data.tables by different column names without changing the column names?
Append:
I managed to join the two tables by X[Y]
, but why merge
function fails in data.table?
Here in the above example, we created a data frame. Let's merge the two data frames with different columns. It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name.
Use the full_join Function to Merge Two R Data Frames With Different Number of Rows. full_join is part of the dplyr package, and it can be used to merge two data frames with a different number of rows.
As of data.table
version 1.9.6 (on CRAN on sep 2015) you can specify the by.x
and by.y
arguments in data.table::merge
merge(x=X, y=Y, by.x="id", by.y="ID")[] # id area value price sales #1: c001 US 100 500 20 #2: c002 UK 200 200 30 #3: c003 EU 300 400 15
However, in data.table 1.9.6 you can also specfy the on
argument in the X[Y]
notation
X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.
X[Y, on=c(id = "ID")] # area id value price sales #1: US c001 100 500 20 #2: UK c002 200 200 30 #3: EU c003 300 400 15
this answer by the data.table
author has more details
Use this operation:
X[Y] # area id value price sales # 1: US c001 100 500 20 # 2: UK c002 200 200 30 # 3: EU c003 300 400 15
or this operation:
Y[X] # ID price sales area value # 1: c001 500 20 US 100 # 2: c002 200 30 UK 200 # 3: c003 400 15 EU 300
Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge
and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame
but data.table uses merge.data.table
. Compare
merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table # Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") : # A non-empty vector of column names for `by` is required.
with
merge.data.frame(X, Y, by.x = "id", by.y = "ID") # id area value price sales # 1 c001 US 100 500 20 # 2 c002 UK 200 200 30 # 3 c003 EU 300 400 15
Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table
team is planning on implementing by.x
and by.y
into the merge.data.table
function, but hasn't done so yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With