I have two datasets that I would like to join using R - Dataset 1 <pre class="prettyprint"><code> ID Name Date Price 1 A 2011 $100 2 B 2012 $200 3 C 2013 $300 </code></pre> Dataset 2 <pre class="prettyprint"><code> ID Date Price 1 2012 $100 1 2013 $200 3 2014 $300 </code></pre> Using <code>left-join()</code> in <code>dplyr</code> by ID I'd end up with this <pre class="prettyprint"><code> ID Name Date.x Price.x Date.y Price.y 1 A 2011 $100 2012 $100 1 A 2011 $100 2013 $200 2 B 2012 $200 3 C 2013 $300 2014 $300 </code></pre> What I would however like to have as a final product is this <pre class="prettyprint"><code> ID Name Date Price 1 A 2011 $100 1 A 2012 $100 1 A 2013 $200 2 B 2012 $200 3 C 2013 $300 3 C 2014 $300 </code></pre> i.e instead of merging to the existing row, I'd like to create a new row when a match is found and duplicate the existing information that won't change (ID and Name) and alter the Date and Price column where necessary. Any ideas as to an efficient way to do this on a large dataset?

You asked about the efficient way, so I'll introduce data.table: <pre class="prettyprint"><code>library(data.table) setDT(DF1) setDT(DF2) # structure your data so ID attributes are only in an ID table idDT = DF1[, .(ID, Name)] DF1[, Name := NULL] # stack data DT = rbind(DF1, DF2) # grab ID attributes if you really need them DT[idDT, on="ID", Name := i.Name] </code></pre> which gives <pre class="prettyprint"><code> ID Date Price Name 1: 1 2011 $100 A 2: 2 2012 $200 B 3: 3 2013 $300 C 4: 1 2012 $100 A 5: 1 2013 $200 A 6: 3 2014 $300 C </code></pre> <code>rbind</code> for data.tables is pretty fast. I wouldn't really expect efficiency to be a big issue when just binding two tables, though. Regarding spinning off the ID attribute, Name, it matches the recommendations of the dplyr package author, who refers to it as making data tidy.

This is a slight variation of @Frank's answer. The main issue is that your 2nd table doesn't have a <code>Name</code> column. This can be obtained quite efficiently using data.table's update while join approach.. <pre class="prettyprint"><code>require(data.table) dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back </code></pre> Now that there's a <code>Name</code> column, we can simply <code>rbind</code> the result. <pre class="prettyprint"><code>ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2) </code></pre> If necessary, reorder the result by reference using <code>setorder()</code>: <pre class="prettyprint"><code>setorder(ans, ID, Name) # by reference, no need to assign the result back # ID Name Date Price # 1: 1 A 2011 $100 # 2: 1 A 2012 $100 # 3: 1 A 2013 $200 # 4: 2 B 2012 $200 # 5: 3 C 2013 $300 # 6: 3 C 2014 $300 </code></pre> <code>:=</code> operator and <code>set*</code> functions in data.table modify the input object by reference. <hr> <pre class="prettyprint"><code>dt1 = fread('ID Name Date Price 1 A 2011 $100 2 B 2012 $200 3 C 2013 $300') dt2 = fread('ID Date Price 1 2012 $100 1 2013 $200 3 2014 $300') </code></pre>

<pre class="prettyprint"><code>df1 <- data.frame( ID=1:3, Name=c("A","B","C"), Date=c(2011,2012,2013), Price=c(100,200,300) ) df2 <- data.frame( ID=c(1,1,3), Date=c(2012,2013,2014), Price=c(100,200,300) ) </code></pre> <code>left_join</code> won't get you that desired output. You can use <code>full_join</code>. <pre class="prettyprint"><code>merged <- full_join(df1, df2, by=c("Date","ID")) </code></pre> Here's a way to get to the output you want with <code>melt</code> from the <code>reshape2</code> package: <pre class="prettyprint"><code>library(reshape2) merged <- melt(merged, id.vars=c("ID","Name","Date")) </code></pre> Then: <pre class="prettyprint"><code>> merged[na.omit(merged$Name), -4] #remove NAs and column from melt ID Name Date value 1 1 A 2011 100 2 2 B 2012 200 3 3 C 2013 300 1.1 1 A 2011 100 2.1 2 B 2012 200 3.1 3 C 2013 300 </code></pre>

Perhaps one of the efficient ways to do that is to use two steps merge. <pre class="prettyprint"><code># create Dataset 1 ID <- 1:3 Name <- c("A", "B", "C") Date <- 2011:2013 Price <- c("$100", "$200", "$300") dataset1 <- data.frame(ID, Name, Date, Price) # Create Dataset 2 ID <- c(1,1,3) Date <- 2012:2014 Price <- c("$100", "$200", "$300") dataset2 <- data.frame(ID, Date, Price) </code></pre> Assign missing "Name" values to Dataset 2 by using <code>merge</code> function in {base} package <pre class="prettyprint"><code>dataset2 <- merge(dataset1[c("ID", "Name")], dataset2) </code></pre> Merge datasets <pre class="prettyprint"><code>merge(dataset1, dataset2, all = T) </code></pre> Which gives: <pre class="prettyprint"><code> ID Name Date Price 1 1 A 2011 $100 2 1 A 2012 $100 3 1 A 2013 $200 4 2 B 2012 $200 5 3 C 2013 $300 6 3 C 2014 $300 </code></pre>

Inner join with <code>nomatch = 0</code>. For example, if all ID in dataset2 is 4, inner join will not spit NA to non-matching IDs. If you remove <code>nomatch = 0</code>, then <code>NA</code>s will be produced. EDIT: added rbindlist wrapper as per @Arun's suggestion <pre class="prettyprint"><code>library("data.table") rbindlist(list(df1, setDT(df1)[i = df2, j = .(ID, Name, Date = i.Date, Price = i.Price), on = .(ID), nomatch = 0])) </code></pre> Output: <pre class="prettyprint"><code> ID Name Date Price 1: 1 A 2011 $100 2: 2 B 2012 $200 3: 3 C 2013 $300 4: 1 A 2012 $100 5: 1 A 2013 $200 6: 3 C 2014 $300 </code></pre>

Joining 2 datasets and creating new rows where matches are found

Tags:

r

dplyr

I have two datasets that I would like to join using R -

Dataset 1

    ID Name Date Price
    1    A   2011 $100
    2    B   2012 $200
    3    C   2013 $300

Dataset 2

    ID Date Price
    1  2012 $100
    1  2013 $200
    3  2014 $300

Using left-join() in dplyr by ID I'd end up with this

    ID Name Date.x Price.x Date.y Price.y
    1   A   2011    $100   2012   $100
    1   A   2011    $100   2013   $200
    2   B   2012    $200
    3   C   2013    $300   2014   $300

What I would however like to have as a final product is this

    ID Name Date Price
    1  A     2011 $100
    1  A     2012 $100
    1  A     2013 $200
    2  B     2012 $200
    3  C     2013 $300
    3  C     2014 $300

i.e instead of merging to the existing row, I'd like to create a new row when a match is found and duplicate the existing information that won't change (ID and Name) and alter the Date and Price column where necessary. Any ideas as to an efficient way to do this on a large dataset?

391

asked Aug 03 '16 18:08

Allan Davids

5 Answers

You asked about the efficient way, so I'll introduce data.table:

library(data.table)
setDT(DF1)
setDT(DF2)

# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]

# stack data
DT = rbind(DF1, DF2)

# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]

which gives

   ID Date Price Name
1:  1 2011  $100    A
2:  2 2012  $200    B
3:  3 2013  $300    C
4:  1 2012  $100    A
5:  1 2013  $200    A
6:  3 2014  $300    C

rbind for data.tables is pretty fast. I wouldn't really expect efficiency to be a big issue when just binding two tables, though.

Regarding spinning off the ID attribute, Name, it matches the recommendations of the dplyr package author, who refers to it as making data tidy.

answered Oct 07 '22 03:10

Frank

This is a slight variation of @Frank's answer. The main issue is that your 2nd table doesn't have a Name column. This can be obtained quite efficiently using data.table's update while join approach..

require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back

Now that there's a Name column, we can simply rbind the result.

ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)

If necessary, reorder the result by reference using setorder():

setorder(ans, ID, Name) # by reference, no need to assign the result back
#    ID Name Date Price
# 1:  1    A 2011  $100
# 2:  1    A 2012  $100
# 3:  1    A 2013  $200
# 4:  2    B 2012  $200
# 5:  3    C 2013  $300
# 6:  3    C 2014  $300

:= operator and set* functions in data.table modify the input object by reference.

dt1 = fread('ID Name   Date Price
              1    A   2011  $100
              2    B   2012  $200
              3    C   2013  $300')

dt2 = fread('ID  Date Price
              1  2012  $100
              1  2013  $200
              3  2014  $300')

answered Oct 07 '22 04:10

Arun

df1 <- data.frame(
  ID=1:3,
  Name=c("A","B","C"),
  Date=c(2011,2012,2013),
  Price=c(100,200,300)
)

df2 <- data.frame(
  ID=c(1,1,3),
  Date=c(2012,2013,2014),
  Price=c(100,200,300)
)

left_join won't get you that desired output. You can use full_join.

merged <- full_join(df1, df2, by=c("Date","ID"))

Here's a way to get to the output you want with melt from the reshape2 package:

library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))

Then:

> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
    ID Name Date value
1    1    A 2011   100
2    2    B 2012   200
3    3    C 2013   300
1.1  1    A 2011   100
2.1  2    B 2012   200
3.1  3    C 2013   300

answered Oct 07 '22 05:10

Warner

Perhaps one of the efficient ways to do that is to use two steps merge.

# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)

# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)

Assign missing "Name" values to Dataset 2 by using merge function in {base} package

dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)

Merge datasets

merge(dataset1, dataset2, all = T)

Which gives:

   ID Name Date Price
1  1    A 2011  $100
2  1    A 2012  $100
3  1    A 2013  $200
4  2    B 2012  $200
5  3    C 2013  $300
6  3    C 2014  $300

answered Oct 07 '22 03:10

aleksapaulius

Inner join with nomatch = 0. For example, if all ID in dataset2 is 4, inner join will not spit NA to non-matching IDs. If you remove nomatch = 0, then NAs will be produced.

EDIT: added rbindlist wrapper as per @Arun's suggestion

library("data.table")
rbindlist(list(df1, 
               setDT(df1)[i = df2, 
                          j = .(ID, Name, Date = i.Date, Price = i.Price),
                          on = .(ID), 
                          nomatch = 0]))

Output:

   ID Name Date Price
1:  1    A 2011  $100
2:  2    B 2012  $200
3:  3    C 2013  $300
4:  1    A 2012  $100
5:  1    A 2013  $200
6:  3    C 2014  $300

answered Oct 07 '22 04:10

Sathish

Related questions
                            
                                How to extract data from a RasterBrick?
                            
                                Figure size in R Markdown
                            
                                parsing quotes out of "NA" strings
                            
                                Create a popup dialog box interactive
                            
                                Saving output of confusionMatrix as a .csv table
                            
                                Remove columns of dataframe based on conditions in R
                            
                                Appending % Symbol with y-axis values in ggplot2
                            
                                MICE does not impute certain columns, but also does not give an error
                            
                                Conditional labeling in ggplot2 using geom_text and subsetting
                            
                                Is it possible to have function object as an element
                            
                                How to compare two data frames/tables and extract data in R?
                            
                                Is it possible to apply color gradient to geom_smooth with ggplot in R?
                            
                                Combine list with vector in R
                            
                                transposed vector by group within data.table
                            
                                Flexdashboards and Leaflet and marker click with Highcharts
                            
                                remove an element if it exists for all sub-elements of a list
                            
                                Speeding up ifelse() without writing C/C++?
                            
                                Hypothesis Testing Skewness and/or Kurtosis in R
                            
                                how to recode (and reverse code) variables in columns with dplyr
                            
                                Understanding influence of random start weights on neural network performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With