I have two datasets that I would like to join using R -
Dataset 1
ID Name Date Price
1 A 2011 $100
2 B 2012 $200
3 C 2013 $300
Dataset 2
ID Date Price
1 2012 $100
1 2013 $200
3 2014 $300
Using left-join()
in dplyr
by ID I'd end up with this
ID Name Date.x Price.x Date.y Price.y
1 A 2011 $100 2012 $100
1 A 2011 $100 2013 $200
2 B 2012 $200
3 C 2013 $300 2014 $300
What I would however like to have as a final product is this
ID Name Date Price
1 A 2011 $100
1 A 2012 $100
1 A 2013 $200
2 B 2012 $200
3 C 2013 $300
3 C 2014 $300
i.e instead of merging to the existing row, I'd like to create a new row when a match is found and duplicate the existing information that won't change (ID and Name) and alter the Date and Price column where necessary. Any ideas as to an efficient way to do this on a large dataset?
To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).
To append data frames in R, use the rbind() function. The rbind() is a built-in R function that can combine several vectors, matrices, and/or data frames by rows. To join two data frames (datasets) vertically, use the rbind() function.
How do I add a column to a DataFrame in R? To add a new column to a dataframe in R you can use the $-operator. For example, to add the column “NewColumn”, you can do like this: dataf$NewColumn <- Values . Now, this will effectively add your new variable to your dataset.
You asked about the efficient way, so I'll introduce data.table:
library(data.table)
setDT(DF1)
setDT(DF2)
# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]
# stack data
DT = rbind(DF1, DF2)
# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]
which gives
ID Date Price Name
1: 1 2011 $100 A
2: 2 2012 $200 B
3: 3 2013 $300 C
4: 1 2012 $100 A
5: 1 2013 $200 A
6: 3 2014 $300 C
rbind
for data.tables is pretty fast. I wouldn't really expect efficiency to be a big issue when just binding two tables, though.
Regarding spinning off the ID attribute, Name, it matches the recommendations of the dplyr package author, who refers to it as making data tidy.
This is a slight variation of @Frank's answer. The main issue is that your 2nd table doesn't have a Name
column. This can be obtained quite efficiently using data.table's update while join approach..
require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back
Now that there's a Name
column, we can simply rbind
the result.
ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)
If necessary, reorder the result by reference using setorder()
:
setorder(ans, ID, Name) # by reference, no need to assign the result back
# ID Name Date Price
# 1: 1 A 2011 $100
# 2: 1 A 2012 $100
# 3: 1 A 2013 $200
# 4: 2 B 2012 $200
# 5: 3 C 2013 $300
# 6: 3 C 2014 $300
:=
operator and set*
functions in data.table modify the input object by reference.
dt1 = fread('ID Name Date Price
1 A 2011 $100
2 B 2012 $200
3 C 2013 $300')
dt2 = fread('ID Date Price
1 2012 $100
1 2013 $200
3 2014 $300')
df1 <- data.frame(
ID=1:3,
Name=c("A","B","C"),
Date=c(2011,2012,2013),
Price=c(100,200,300)
)
df2 <- data.frame(
ID=c(1,1,3),
Date=c(2012,2013,2014),
Price=c(100,200,300)
)
left_join
won't get you that desired output. You can use full_join
.
merged <- full_join(df1, df2, by=c("Date","ID"))
Here's a way to get to the output you want with melt
from the reshape2
package:
library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))
Then:
> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
ID Name Date value
1 1 A 2011 100
2 2 B 2012 200
3 3 C 2013 300
1.1 1 A 2011 100
2.1 2 B 2012 200
3.1 3 C 2013 300
Perhaps one of the efficient ways to do that is to use two steps merge.
# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)
# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)
Assign missing "Name" values to Dataset 2 by using merge
function in {base} package
dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)
Merge datasets
merge(dataset1, dataset2, all = T)
Which gives:
ID Name Date Price
1 1 A 2011 $100
2 1 A 2012 $100
3 1 A 2013 $200
4 2 B 2012 $200
5 3 C 2013 $300
6 3 C 2014 $300
Inner join with nomatch = 0
. For example, if all ID in dataset2 is 4, inner join will not spit NA to non-matching IDs. If you remove nomatch = 0
, then NA
s will be produced.
EDIT: added rbindlist wrapper as per @Arun's suggestion
library("data.table")
rbindlist(list(df1,
setDT(df1)[i = df2,
j = .(ID, Name, Date = i.Date, Price = i.Price),
on = .(ID),
nomatch = 0]))
Output:
ID Name Date Price
1: 1 A 2011 $100
2: 2 B 2012 $200
3: 3 C 2013 $300
4: 1 A 2012 $100
5: 1 A 2013 $200
6: 3 C 2014 $300
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With