Merge 2 dataframes by matching dates

Question

I have two dataframes:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

and:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

merged data:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1

code: merge(x= df1, y= df2, by= 'id', all.x= T)

I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.

Jaap · Accepted Answer

Instead of using merge with data.table, you can also simply join as follows:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

Doing this with dplyr:

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

For a comparison on a large dataset, see the answer of @Arun.

Merge 2 dataframes by matching dates

Tags:

merge

dataframe

r

dplyr

Dheeraj Singh

1 Answers

Jaap

Recent Activity

Donate For Us

Merge 2 dataframes by matching dates

Tags:

merge

dataframe

r

dplyr

Dheeraj Singh

1 Answers

Jaap

Related questions

Recent Activity

Donate For Us