Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joining 2 datasets and creating new rows where matches are found

Tags:

r

dplyr

I have two datasets that I would like to join using R -

Dataset 1

    ID Name Date Price
    1    A   2011 $100
    2    B   2012 $200
    3    C   2013 $300

Dataset 2

    ID Date Price
    1  2012 $100
    1  2013 $200
    3  2014 $300

Using left-join() in dplyr by ID I'd end up with this

    ID Name Date.x Price.x Date.y Price.y
    1   A   2011    $100   2012   $100
    1   A   2011    $100   2013   $200
    2   B   2012    $200
    3   C   2013    $300   2014   $300

What I would however like to have as a final product is this

    ID Name Date Price
    1  A     2011 $100
    1  A     2012 $100
    1  A     2013 $200
    2  B     2012 $200
    3  C     2013 $300
    3  C     2014 $300

i.e instead of merging to the existing row, I'd like to create a new row when a match is found and duplicate the existing information that won't change (ID and Name) and alter the Date and Price column where necessary. Any ideas as to an efficient way to do this on a large dataset?

like image 391
Allan Davids Avatar asked Aug 03 '16 18:08

Allan Davids


People also ask

How do you combine data sets?

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

How do I concatenate datasets in R?

To append data frames in R, use the rbind() function. The rbind() is a built-in R function that can combine several vectors, matrices, and/or data frames by rows. To join two data frames (datasets) vertically, use the rbind() function.

How do I add a column from one dataset to another in R?

How do I add a column to a DataFrame in R? To add a new column to a dataframe in R you can use the $-operator. For example, to add the column “NewColumn”, you can do like this: dataf$NewColumn <- Values . Now, this will effectively add your new variable to your dataset.


5 Answers

You asked about the efficient way, so I'll introduce data.table:

library(data.table)
setDT(DF1)
setDT(DF2)

# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]

# stack data
DT = rbind(DF1, DF2)

# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]

which gives

   ID Date Price Name
1:  1 2011  $100    A
2:  2 2012  $200    B
3:  3 2013  $300    C
4:  1 2012  $100    A
5:  1 2013  $200    A
6:  3 2014  $300    C

rbind for data.tables is pretty fast. I wouldn't really expect efficiency to be a big issue when just binding two tables, though.

Regarding spinning off the ID attribute, Name, it matches the recommendations of the dplyr package author, who refers to it as making data tidy.

like image 95
Frank Avatar answered Oct 07 '22 03:10

Frank


This is a slight variation of @Frank's answer. The main issue is that your 2nd table doesn't have a Name column. This can be obtained quite efficiently using data.table's update while join approach..

require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back

Now that there's a Name column, we can simply rbind the result.

ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)

If necessary, reorder the result by reference using setorder():

setorder(ans, ID, Name) # by reference, no need to assign the result back
#    ID Name Date Price
# 1:  1    A 2011  $100
# 2:  1    A 2012  $100
# 3:  1    A 2013  $200
# 4:  2    B 2012  $200
# 5:  3    C 2013  $300
# 6:  3    C 2014  $300

:= operator and set* functions in data.table modify the input object by reference.


dt1 = fread('ID Name   Date Price
              1    A   2011  $100
              2    B   2012  $200
              3    C   2013  $300')

dt2 = fread('ID  Date Price
              1  2012  $100
              1  2013  $200
              3  2014  $300')
like image 30
Arun Avatar answered Oct 07 '22 04:10

Arun


df1 <- data.frame(
  ID=1:3,
  Name=c("A","B","C"),
  Date=c(2011,2012,2013),
  Price=c(100,200,300)
)

df2 <- data.frame(
  ID=c(1,1,3),
  Date=c(2012,2013,2014),
  Price=c(100,200,300)
)

left_join won't get you that desired output. You can use full_join.

merged <- full_join(df1, df2, by=c("Date","ID"))

Here's a way to get to the output you want with melt from the reshape2 package:

library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))

Then:

> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
    ID Name Date value
1    1    A 2011   100
2    2    B 2012   200
3    3    C 2013   300
1.1  1    A 2011   100
2.1  2    B 2012   200
3.1  3    C 2013   300
like image 22
Warner Avatar answered Oct 07 '22 05:10

Warner


Perhaps one of the efficient ways to do that is to use two steps merge.

# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)

# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)

Assign missing "Name" values to Dataset 2 by using merge function in {base} package

dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)

Merge datasets

merge(dataset1, dataset2, all = T)

Which gives:

   ID Name Date Price
1  1    A 2011  $100
2  1    A 2012  $100
3  1    A 2013  $200
4  2    B 2012  $200
5  3    C 2013  $300
6  3    C 2014  $300
like image 33
aleksapaulius Avatar answered Oct 07 '22 03:10

aleksapaulius


Inner join with nomatch = 0. For example, if all ID in dataset2 is 4, inner join will not spit NA to non-matching IDs. If you remove nomatch = 0, then NAs will be produced.

EDIT: added rbindlist wrapper as per @Arun's suggestion

library("data.table")
rbindlist(list(df1, 
               setDT(df1)[i = df2, 
                          j = .(ID, Name, Date = i.Date, Price = i.Price),
                          on = .(ID), 
                          nomatch = 0]))

Output:

   ID Name Date Price
1:  1    A 2011  $100
2:  2    B 2012  $200
3:  3    C 2013  $300
4:  1    A 2012  $100
5:  1    A 2013  $200
6:  3    C 2014  $300
like image 42
Sathish Avatar answered Oct 07 '22 04:10

Sathish