Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge two data.table by different column names?

I have two data.table X and Y.

columns in X: area, id, value
columns in Y: ID, price, sales

Create the two data.tables:

X = data.table(area=c('US', 'UK', 'EU'),                id=c('c001', 'c002', 'c003'),                value=c(100, 200, 300)               )  Y = data.table(ID=c('c001', 'c002', 'c003'),                price=c(500, 200, 400),                sales=c(20, 30, 15)               ) 

And I set keys for X and Y:

setkey(X, id) setkey(Y, ID) 

Now I try to join X and Y by id in X and ID in Y:

merge(X, Y) merge(X, Y, by=c('id', 'ID')) merge(X, Y, by.x='id', by.y='ID') 

All raised error saying that column names in the by argument invalid.

I referred to the manual of data.table and found the merge function not supporting by.x and by.y arguments.

How could I join two data.tables by different column names without changing the column names?

Append:
I managed to join the two tables by X[Y], but why merge function fails in data.table?

like image 950
Zelong Avatar asked Apr 25 '15 13:04

Zelong


People also ask

How do I merge two columns of datasets?

Here in the above example, we created a data frame. Let's merge the two data frames with different columns. It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name.

How do I combine two data frames with different number of rows?

Use the full_join Function to Merge Two R Data Frames With Different Number of Rows. full_join is part of the dplyr package, and it can be used to merge two data frames with a different number of rows.


2 Answers

As of data.table version 1.9.6 (on CRAN on sep 2015) you can specify the by.x and by.y arguments in data.table::merge

merge(x=X, y=Y, by.x="id", by.y="ID")[] #     id area value price sales #1: c001   US   100   500    20 #2: c002   UK   200   200    30 #3: c003   EU   300   400    15 

However, in data.table 1.9.6 you can also specfy the on argument in the X[Y] notation

X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.

X[Y, on=c(id = "ID")] #   area   id value price sales #1:   US c001   100   500    20 #2:   UK c002   200   200    30 #3:   EU c003   300   400    15 

this answer by the data.table author has more details

like image 177
tospig Avatar answered Sep 24 '22 13:09

tospig


OUTDATED


Use this operation:

X[Y] #    area   id value price sales # 1:   US c001   100   500    20 # 2:   UK c002   200   200    30 # 3:   EU c003   300   400    15 

or this operation:

Y[X] #      ID price sales area value # 1: c001   500    20   US   100 # 2: c002   200    30   UK   200 # 3: c003   400    15   EU   300 

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table # Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :  # A non-empty vector of column names for `by` is required. 

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID") #     id area value price sales # 1 c001   US   100   500    20 # 2 c002   UK   200   200    30 # 3 c003   EU   300   400    15 

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

like image 42
Richard Erickson Avatar answered Sep 23 '22 13:09

Richard Erickson