I have two dataframes like so: <pre class="prettyprint"><code>set.seed(1) df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30)) to.merge <- data.frame(x=c(2, 2, 2, 3, 2), y=c(1, 1, 1, 5, 4), time=c(17, 12, 11.6, 22.5, 2), val=letters[1:5], stringsAsFactors=F) </code></pre> I want to merge <code>to.merge</code> into <code>df</code> (with <code>all.x=T</code>) such that: <ul> <li> <code>df$x == to.merge$x</code> AND</li> <li> <code>df$y == to.merge$y</code> AND</li> <li> <code>abs(df$time - to.merge$time) <= 1</code>; in the case of multiple <code>to.merge</code> that satisfy, we pick the one that minimises this distances.</li> </ul> How can I do this? So my desired result is (this is just <code>df</code> with the corresponding <code>value</code> column of <code>to.merge</code> added for matching rows): <pre class="prettyprint"><code> x y time val 1 1 1 8 NA 2 2 1 11 c 3 3 1 17 NA 4 1 2 27 NA 5 2 2 6 NA 6 3 2 27 NA 7 1 3 28 NA 8 2 3 20 NA 9 3 3 19 NA 10 1 4 2 NA 11 2 4 6 NA 12 3 4 5 NA 13 1 5 21 NA 14 2 5 12 NA 15 3 5 23 d </code></pre> where <code>to.merge</code> was: <pre class="prettyprint"><code> x y time val 1 2 1 17.0 a 2 2 1 12.0 b 3 2 1 11.6 c 4 3 5 22.5 d 5 2 4 2.0 e </code></pre> Note - (2, 1, 17, a) didn't match into <code>df</code> because the <code>time</code> 17 was more than 1 away from <code>df$time</code> 11 for (X, Y) = (2, 1). Also, there were two rows in <code>to.merge</code> that satisfied the condition for matching to <code>df</code>'s (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its <code>time</code> was the closest to 11. Finally, there may be rows in <code>to.merge</code> that do not match anything in <code>df</code>. <hr> One way that works is a for-loop, but it takes far too long for my data (<code>df</code> has ~12k rows and <code>to.merge</code> has ~250k rows) <pre class="prettyprint"><code>df$value <- NA for (i in 1:nrow(df)) { row <- df[i, ] idx <- which(row$x == to.merge$x & row$y == to.merge$y & abs(row$time - to.merge$time) <= 1) if (length(idx)) { j <- idx[which.min(row$time - to.merge$time[idx])] df$val[i] <- to.merge$val[j] } } </code></pre> I feel that I can somehow do a merge, like: <pre class="prettyprint"><code>to.merge$closest_time_in_df <- sapply(to.merge$time, function (tm) { dts <- abs(tm - df$time) # difference must be at most 1 if (min(dts) <= 1) { df$time[which.min(dts)] } else { NA } }) merge(df, to.merge, by.x=c('x', 'y', 'time'), by.y=c('x', 'y', 'closest_time_in_df'), all.x=T) </code></pre> But this doesn't merge the <code>(2, 1, 11)</code> row because <code>to.merge$closest_time_in_df</code> for <code>(2, 1, 11.5, c)</code> is 12, but a time of 12 in <code>df</code> corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.

Use <code>data.table</code> and <code>roll='nearest'</code> or to limit to 1, <code>roll = 1, rollends = c(TRUE,TRUE)</code> eg <pre class="prettyprint"><code>library(data.table) # create data.tables with the same key columns (x, y, time) DT <- data.table(df, key = names(df)) tm <- data.table(to.merge, key = key(DT)) # use join syntax with roll = 'nearest' tm[DT, roll='nearest'] # x y time val # 1: 1 1 8 NA # 2: 1 2 27 NA # 3: 1 3 28 NA # 4: 1 4 2 NA # 5: 1 5 21 NA # 6: 2 1 11 c # 7: 2 2 6 NA # 8: 2 3 20 NA # 9: 2 4 6 e # 10: 2 5 12 NA # 11: 3 1 17 NA # 12: 3 2 27 NA # 13: 3 3 19 NA # 14: 3 4 5 NA # 15: 3 5 23 d </code></pre> You can limit your self to looking forward and back (1) by setting <code>roll=-1</code> and <code>rollends = c(TRUE,TRUE)</code> <pre class="prettyprint"><code>new <- tm[DT, roll=-1, rollends =c(TRUE,TRUE)] new x y time val 1: 1 1 8 NA 2: 1 2 27 NA 3: 1 3 28 NA 4: 1 4 2 NA 5: 1 5 21 NA 6: 2 1 11 c 7: 2 2 6 NA 8: 2 3 20 NA 9: 2 4 6 NA 10: 2 5 12 NA 11: 3 1 17 NA 12: 3 2 27 NA 13: 3 3 19 NA 14: 3 4 5 NA 15: 3 5 23 d </code></pre> Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join) <pre class="prettyprint"><code>new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL] new x y time val 1: 1 1 8 NA 2: 1 2 27 NA 3: 1 3 28 NA 4: 1 4 2 NA 5: 1 5 21 NA 6: 2 1 11 c 7: 2 2 6 NA 8: 2 3 20 NA 9: 2 4 6 NA 10: 2 5 12 NA 11: 3 1 17 NA 12: 3 2 27 NA 13: 3 3 19 NA 14: 3 4 5 NA 15: 3 5 23 d </code></pre>

Merge dataframes on matching A, B and *closest* C?

Q: How do I merge two Dataframes with the same column names in R?

To combine two data frames with same columns in R language, call rbind() function, and pass the two data frames, as arguments. rbind() function returns the resulting data frame created from concatenating the given two data frames. For rbind() function to combine the given data frames, the column names must match.

Q: How do I merge Dataframes based on rows?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

I have two dataframes like so:

set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
                       y=c(1, 1, 1, 5, 4),
                       time=c(17, 12, 11.6, 22.5, 2),
                       val=letters[1:5],
                       stringsAsFactors=F)

I want to merge to.merge into df (with all.x=T) such that:

df$x == to.merge$x AND
df$y == to.merge$y AND
abs(df$time - to.merge$time) <= 1; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances.

How can I do this?

So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows):

   x y time val
1  1 1    8  NA
2  2 1   11   c
3  3 1   17  NA
4  1 2   27  NA
5  2 2    6  NA
6  3 2   27  NA
7  1 3   28  NA
8  2 3   20  NA
9  3 3   19  NA
10 1 4    2  NA
11 2 4    6  NA
12 3 4    5  NA
13 1 5   21  NA
14 2 5   12  NA
15 3 5   23   d

where to.merge was:

  x y time val
1 2 1 17.0   a
2 2 1 12.0   b
3 2 1 11.6   c
4 3 5 22.5   d
5 2 4  2.0   e

Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1).

Also, there were two rows in to.merge that satisfied the condition for matching to df's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11.

Finally, there may be rows in to.merge that do not match anything in df.

One way that works is a for-loop, but it takes far too long for my data (df has ~12k rows and to.merge has ~250k rows)

df$value <- NA
for (i in 1:nrow(df)) {
    row <- df[i, ]
    idx <- which(row$x == to.merge$x &
                 row$y == to.merge$y &
                 abs(row$time - to.merge$time) <= 1)
    if (length(idx)) {
        j <- idx[which.min(row$time - to.merge$time[idx])]
        df$val[i] <- to.merge$val[j]
    }
}

I feel that I can somehow do a merge, like:

to.merge$closest_time_in_df <- sapply(to.merge$time,
                                  function (tm) {
                                     dts <- abs(tm - df$time)
                                     # difference must be at most 1
                                     if (min(dts) <= 1) {
                                         df$time[which.min(dts)]
                                     } else {
                                         NA
                                     }
                                  })
merge(df, to.merge,
      by.x=c('x', 'y', 'time'),
      by.y=c('x', 'y', 'closest_time_in_df'),
      all.x=T)

But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails.

How do I merge two Dataframes with the same column names in R?

To combine two data frames with same columns in R language, call rbind() function, and pass the two data frames, as arguments. rbind() function returns the resulting data frame created from concatenating the given two data frames. For rbind() function to combine the given data frames, the column names must match.

How do I merge Dataframes based on rows?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)

eg

library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))

# use join syntax with roll = 'nearest'


tm[DT, roll='nearest']

#     x y time val
#  1: 1 1    8  NA
#  2: 1 2   27  NA
#  3: 1 3   28  NA
#  4: 1 4    2  NA
#  5: 1 5   21  NA
#  6: 2 1   11   c
#  7: 2 2    6  NA
#  8: 2 3   20  NA
#  9: 2 4    6   e
# 10: 2 5   12  NA
# 11: 3 1   17  NA
# 12: 3 2   27  NA
# 13: 3 3   19  NA
# 14: 3 4    5  NA
# 15: 3 5   23   d

You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)

new <- tm[DT, roll=-1, rollends  =c(TRUE,TRUE)]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)

new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

Using merge couple of times and aggregate once, here is how to do it.

set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)

#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
##   x y time.x val time.y dif
## 1 2 1   17.0   a     11 6.0
## 2 2 1   12.0   b     11 1.0
## 3 2 1   11.6   c     11 0.6
## 4 2 4    2.0   e      6 4.0
## 5 3 5   22.5   d     23 0.5

#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
##   x y dif time.x val time.y
## 1 2 1 0.6   11.6   c     11
## 2 2 4 4.0    2.0   e      6
## 3 3 5 0.5   22.5   d     23

#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
##    x y time  val
## 1  1 1    8 <NA>
## 2  1 2   27 <NA>
## 3  1 3   28 <NA>
## 4  1 4    2 <NA>
## 5  1 5   21 <NA>
## 6  2 1   11    c
## 7  2 2    6 <NA>
## 8  2 3   20 <NA>
## 9  2 4    6 <NA>
## 10 2 5   12 <NA>
## 11 3 1   17 <NA>
## 12 3 2   27 <NA>
## 13 3 3   19 <NA>
## 14 3 4    5 <NA>
## 15 3 5   23    d

Merge dataframes on matching A, B and closest C?

Tags:

r

mathematical.coffee

People also ask

2 Answers

mnel

CHP

Recent Activity

Donate For Us