Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge dataframes on matching A, B and *closest* C?

Tags:

r

I have two dataframes like so:

set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
                       y=c(1, 1, 1, 5, 4),
                       time=c(17, 12, 11.6, 22.5, 2),
                       val=letters[1:5],
                       stringsAsFactors=F)

I want to merge to.merge into df (with all.x=T) such that:

  • df$x == to.merge$x AND
  • df$y == to.merge$y AND
  • abs(df$time - to.merge$time) <= 1; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances.

How can I do this?

So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows):

   x y time val
1  1 1    8  NA
2  2 1   11   c
3  3 1   17  NA
4  1 2   27  NA
5  2 2    6  NA
6  3 2   27  NA
7  1 3   28  NA
8  2 3   20  NA
9  3 3   19  NA
10 1 4    2  NA
11 2 4    6  NA
12 3 4    5  NA
13 1 5   21  NA
14 2 5   12  NA
15 3 5   23   d

where to.merge was:

  x y time val
1 2 1 17.0   a
2 2 1 12.0   b
3 2 1 11.6   c
4 3 5 22.5   d
5 2 4  2.0   e

Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1).

Also, there were two rows in to.merge that satisfied the condition for matching to df's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11.

Finally, there may be rows in to.merge that do not match anything in df.


One way that works is a for-loop, but it takes far too long for my data (df has ~12k rows and to.merge has ~250k rows)

df$value <- NA
for (i in 1:nrow(df)) {
    row <- df[i, ]
    idx <- which(row$x == to.merge$x &
                 row$y == to.merge$y &
                 abs(row$time - to.merge$time) <= 1)
    if (length(idx)) {
        j <- idx[which.min(row$time - to.merge$time[idx])]
        df$val[i] <- to.merge$val[j]
    }
}

I feel that I can somehow do a merge, like:

to.merge$closest_time_in_df <- sapply(to.merge$time,
                                  function (tm) {
                                     dts <- abs(tm - df$time)
                                     # difference must be at most 1
                                     if (min(dts) <= 1) {
                                         df$time[which.min(dts)]
                                     } else {
                                         NA
                                     }
                                  })
merge(df, to.merge,
      by.x=c('x', 'y', 'time'),
      by.y=c('x', 'y', 'closest_time_in_df'),
      all.x=T)

But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.

like image 694
mathematical.coffee Avatar asked Apr 19 '13 01:04

mathematical.coffee


People also ask

How do I merge two Dataframes with the same column names in R?

To combine two data frames with same columns in R language, call rbind() function, and pass the two data frames, as arguments. rbind() function returns the resulting data frame created from concatenating the given two data frames. For rbind() function to combine the given data frames, the column names must match.

How do I merge Dataframes based on rows?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.


2 Answers

Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)

eg

library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))

# use join syntax with roll = 'nearest'


tm[DT, roll='nearest']

#     x y time val
#  1: 1 1    8  NA
#  2: 1 2   27  NA
#  3: 1 3   28  NA
#  4: 1 4    2  NA
#  5: 1 5   21  NA
#  6: 2 1   11   c
#  7: 2 2    6  NA
#  8: 2 3   20  NA
#  9: 2 4    6   e
# 10: 2 5   12  NA
# 11: 3 1   17  NA
# 12: 3 2   27  NA
# 13: 3 3   19  NA
# 14: 3 4    5  NA
# 15: 3 5   23   d

You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)

new <- tm[DT, roll=-1, rollends  =c(TRUE,TRUE)]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)

new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d
like image 102
mnel Avatar answered Oct 21 '22 00:10

mnel


Using merge couple of times and aggregate once, here is how to do it.

set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)

#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
##   x y time.x val time.y dif
## 1 2 1   17.0   a     11 6.0
## 2 2 1   12.0   b     11 1.0
## 3 2 1   11.6   c     11 0.6
## 4 2 4    2.0   e      6 4.0
## 5 3 5   22.5   d     23 0.5

#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
##   x y dif time.x val time.y
## 1 2 1 0.6   11.6   c     11
## 2 2 4 4.0    2.0   e      6
## 3 3 5 0.5   22.5   d     23

#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
##    x y time  val
## 1  1 1    8 <NA>
## 2  1 2   27 <NA>
## 3  1 3   28 <NA>
## 4  1 4    2 <NA>
## 5  1 5   21 <NA>
## 6  2 1   11    c
## 7  2 2    6 <NA>
## 8  2 3   20 <NA>
## 9  2 4    6 <NA>
## 10 2 5   12 <NA>
## 11 3 1   17 <NA>
## 12 3 2   27 <NA>
## 13 3 3   19 <NA>
## 14 3 4    5 <NA>
## 15 3 5   23    d
like image 6
CHP Avatar answered Oct 21 '22 00:10

CHP