R – How to join two data frames by nearest time-date?

Tags:

I have 2 data sets, each containing a date-time value in POSIXlt format, and some other numeric and character variables.

I want to combine both data sets based on the date-time column. But the date stamps of both data sets do not match, so I need to combine them by nearest date (before or after). In my example, data value "e" from 2016-03-01 23:52:00 needs to be combined with "binH" at 2016-03-02 00:00:00, not "binG".

Is there a function that would allow me to combine my data sets by nearest date-time value, even if it is after?

I have found ways of combining dates to the next previous date using the cut() function, or the roll=Inf function in data.tables. But I couldn't get my timestamps into any format roll='nearest' would accept.

    >df1
    date1 value
    1 2016-03-01 17:52:00     a
    2 2016-03-01 18:01:30     b
    3 2016-03-01 18:05:00     c
    4 2016-03-01 20:42:30     d
    5 2016-03-01 23:52:00     e

    >df2
    date2 bin_name
    1 2016-03-01 17:00:00     binA
    2 2016-03-01 18:00:00     binB
    3 2016-03-01 19:00:00     binC
    4 2016-03-01 20:00:00     binD
    5 2016-03-01 21:00:00     binE
    6 2016-03-01 22:00:00     binF
    7 2016-03-01 23:00:00     binG
    8 2016-03-02 00:00:00     binH
    9 2016-03-02 01:00:00     binI

706

asked Sep 02 '16 00:09

user2223405

2 Answers

data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.

(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

                  date data.new
1: 2016-03-02 12:20:00        t
2: 2016-03-07 12:20:00        u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )


                  date data.old
1: 2016-03-02 12:20:00        a
2: 2016-03-07 12:20:00        b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

                  date data.new data.old
1: 2015-03-02 12:20:00        t        d
2: 2016-03-02 12:20:00        t        a
3: 2016-03-07 12:20:00        u        b
4: 2016-04-02 12:20:00        v        c

I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:

combined <- old[ new, roll = "nearest" ]
combined

                  date data.old data.new
1: 2016-03-02 12:20:00        a        t
2: 2016-03-07 12:20:00        b        u
3: 2016-04-02 12:20:00        c        v

answered Oct 14 '22 08:10

rosscova

I had a similar problem, but instead of using data.table or tidyverse I created my own function amerge for "approximate merge". It takes 4 arguments:

two data frames,
a vector of column names for "firm" (not approximate) merge - these must exist in both data frames,
and the name of a single column (in both data frames) for approximate merge. It will work for any numeric values, including dates.

The idea was to merge rows 1-to-1 of best matches, and not loose any rows from any data frame. Here is my commented code with a working example.

amerge <- function(d1, d2, firm=NULL, approx=NULL) {
  rt = Sys.time()

  # Take care of conflicting column names
  n2 = data.frame(oldname = names(d2), newname = names(d2))
  n2$newname = as.character(n2$newname)
  n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
    paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")

  # Add unique row IDs
  if (length(c(firm, approx))>1) {
    d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
    d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
  } else {
    d1$ID1 = factor(d1[,c(approx,firm)])
    d2$ID2 = factor(d2[,c(approx,firm)])
  }

  # Perform initial merge on the 'firm' parameters, if any
  # Otherwise match all to all
  if (length(firm)>0) {
    t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
  } else {
    names(d2)= c(n2$newname,"ID2")
    t1 = data.frame()
    for (i1 in 1:nrow(d1)) {
      trow = d1[i1,]
      t1 = rbind(t1, cbind(trow, d2))
    }
  }

  # Match by the most approximate record
  if (length(approx)==1) {
    # Calculate the differential for approximate merging
    t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
    # Sort data by ascending DIFF, so that best matching records are used first
    t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
    t2 = data.frame()
    d2$used = 0
    # For each record of d1, find match from d2
    for (i1 in na.omit(unique(t1$ID1))) {
      tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
      # If there are non-missing records, get the one with minimum DIFF (top one)
      if (nrow(tx)>0) {
        tx = tx[1,]
        # If matching record found, remove it from the pool, so it's not used again
        t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
        # And mark it as used
        d2$used[d2$ID2==tx$ID2] = 1
      } else {
        # If there are no non-missing records, just get the first one from the top
        tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
      }
      t2 = rbind(t2,tx)
    }
  } else {
    t2 = t1
  }
  # Make the records the same order as d1
  t2 = t2[match(d1$ID1, t2$ID1),]
  # Add unmatched records from d2 to the end of output
  if (any(d2$used==0)) {
    tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
    tx = tx[!duplicated(tx$ID2),]
    tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
    t2 = rbind(t2,tx)
    t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
  }
  t2$DIFF = t2$ID1 = t2$ID2 = NULL
  cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
  return(t2)
}

And the example:

new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))

amerge(old, new, firm="ID", approx="date")

It outputs:

   ID                date  old              date.2  new
2   1 2016-03-07 12:20:00    a 2016-03-07 12:20:00    u
6   1 2016-04-02 12:20:00    b 2016-04-02 12:20:00    v
7   1 2016-03-01 10:09:00    c                <NA> <NA>
10  1 2015-04-12 10:09:00    d                <NA> <NA>
13  1 2016-03-03 12:20:00    e 2016-03-02 12:20:00    t
16  2 2016-04-12 11:03:00 <NA> 2016-04-12 11:03:00    x

So works for my purpose as intended - there is exactly one copy of each row from both data frames - matched by shortest time difference. One note: the function copies date.2 into date column where the date would be missing.

answered Oct 14 '22 08:10

Michal J Figurski

Related questions
                            
                                How to get screen resolution from JavaScript in R Shiny?
                            
                                ggplot2, facet wrap, fixed y scale for each row, free scale between rows
                            
                                Visualize Parse Tree Structure
                            
                                How display length of branches in phylogenetic tree
                            
                                vim-rmarkdown plugin configuration
                            
                                A caterpillar plot of just the "significant" random effects from a mixed effects model
                            
                                R - Split by "\n" or three spaces and retain at least one space when there are three spaces
                            
                                Fastest way for doing 21 day rolling sum for an ActivityType
                            
                                Aggregating all unique values of each column of data frame
                            
                                How to merge multiple data.frames and sum and average columns at the same time in R
                            
                                ggplot: line plot for discrete x-axis
                            
                                R foreach: from single-machine to cluster
                            
                                Identify a weblink in bold in R
                            
                                Change values in data frame in a specific row using dplyr
                            
                                Delete Redundant columns in R [duplicate]
                            
                                glmnet: How do I know which factor level of my response is coded as 1 in logistic regression
                            
                                Put quotation marks around each element of a vector, and separate with comma
                            
                                How does the PACKAGE argument to .Call work?
                            
                                R - sample used in %in% modify dataframe which is being subsetted
                            
                                Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R – How to join two data frames by nearest time-date?

Tags:

datetime

r

posixlt

user2223405

People also ask

2 Answers

rosscova

Michal J Figurski

Recent Activity

Donate For Us