Creating variable in R data frame depending on another data frame

Question

I am seeking help after having wasted almost a day. I have a big data frame (bdf) and a small data frame (sdf). I want to add variable z to bdf depending on the value of sdf$y (which changes as a function of a time variable).

Here is a reproducible example:

bdf <- data.frame(tb = seq(as.POSIXct("2013-05-19 17:11:22 GMT", tz="GMT"), by=5624*24, length.out=10))

bdf
                tb
1  2013-05-19 17:11:22
2  2013-05-21 06:40:58
3  2013-05-22 20:10:34
4  2013-05-24 09:40:10
5  2013-05-25 23:09:46
6  2013-05-27 12:39:22
7  2013-05-29 02:08:58
8  2013-05-30 15:38:34
9  2013-06-01 05:08:10
10 2013-06-02 18:37:46


sdf <- data.frame(ts = as.POSIXct(c("2013-05-22", "2013-05-25", "2013-05-30"), tz="GMT"), y = c(0.2, -0.1, 0.3))

> sdf
      ts    y
1 2013-05-22  0.2
2 2013-05-25 -0.1
3 2013-05-30  0.3

I want to create variable z in bdf with the following values of sdf$y:

0.2 for rows where bdf$tb ranges from the first bdf$tb value to mid-way between the 1st and 2nd value of sdf$ts. In this simple example, that is the case of rows 1 to 3 of dbf which have times bdf$tb below "2013-05-23 12:00:00 GMT".
-0.1 for rows where bdf$tb ranges from mid-way between the 1st and 2nd value of sdf$ts to mid-way between the 2nd and 3rd value of sdf$ts. In this simple example, that is the case of rows 4 and 5 of dbf which have times bdf$tb between "2013-05-23 12:00:00 GMT" and "2013-05-27 12:00:00 GMT".
0.3 for all rows where bdf$tb ranges from mid-way between the 2nd and 3rd value of sdf$ts to the last value of bdf$tb. In this simple example, that is the case of rows 1 to 6 to 10 of dbf which have times larger than "2013-05-23 12:00:00 GMT".

Hence, in the end, the big dataframe bdf should look like this:

                 tb    z
1  2013-05-19 17:11:22  0.2
2  2013-05-21 06:40:58  0.2
3  2013-05-22 20:10:34  0.2
4  2013-05-24 09:40:10 -0.1
5  2013-05-25 23:09:46 -0.1
6  2013-05-27 12:39:22  0.3
7  2013-05-29 02:08:58  0.3
8  2013-05-30 15:38:34  0.3
9  2013-06-01 05:08:10  0.3
10 2013-06-02 18:37:46  0.3

I could not succeed using dplyr::mutate and got nowhere using loops... Any help would be much appreciated. I hope that I clearly described the issue as adhered to the etiquette (it is my first question).

Arun · Accepted Answer

Here's a solution using data.table's rolling joins:

require(data.table)
setkey(setDT(sdf), ts)
sdf[bdf, roll = "nearest"]
#                      ts    y
#  1: 2013-05-19 17:11:22  0.2
#  2: 2013-05-21 06:40:58  0.2
#  3: 2013-05-22 20:10:34  0.2
#  4: 2013-05-24 09:40:10 -0.1
#  5: 2013-05-25 23:09:46 -0.1
#  6: 2013-05-27 12:39:22  0.3
#  7: 2013-05-29 02:08:58  0.3
#  8: 2013-05-30 15:38:34  0.3
#  9: 2013-06-01 05:08:10  0.3
# 10: 2013-06-02 18:37:46  0.3

setDT converts data.frame to data.table by reference.
setkey sorts the data.table by reference in increasing order by the columns provided, and marks those columns as key columns (so that we can join on those key columns later.
In data.table, x[i] performs a join when i is a data.table. I'll refer you to this answer to catch up on data.table joins, if you're not already familiar with.
x[i] performs an equi-join. That is, it finds matching row indices in x for every row in i and then extracts those rows from x to return the join result along with the corresponding row from i. In case a row in i doesn't find matching row indices in x, that row would have NA for x by default.

However, x[i, roll = .] performs a rolling join. When there's no match, either the last observation is carried forward (roll = TRUE or -Inf), or the next observation can be carried backward (roll = Inf), or rolled to the nearest value (roll = "nearest"). And in this case you require roll = "nearest" IIUC.

HTH

Khashaa · Answer

This seems now absolutely unnecessary, but in base R

bdf$z <- numeric(nrow(bdf))
for(i in seq_along(bdf$z)){
  ind <- which.min(abs(bdf$tb[i] - sdf$ts))
  bdf$z[i] <- sdf$y[ind]
}

While being little clumsy, it has an advantage in clarity, which accomodates easy adaptation to dplyr

library(dplyr)
bdf %>% rowwise() %>% 
  mutate(z= sdf$y[which.min(abs(as.numeric(tb)-as.numeric(sdf$ts)))])

#Source: local data frame [10 x 2]
#Groups: <by row>

#                    tb    z
#1  2013-05-19 17:11:22  0.2
#2  2013-05-21 06:40:58  0.2
#3  2013-05-22 20:10:34  0.2
#4  2013-05-24 09:40:10 -0.1
#5  2013-05-25 23:09:46 -0.1
#6  2013-05-27 12:39:22  0.3
#7  2013-05-29 02:08:58  0.3
#8  2013-05-30 15:38:34  0.3
#9  2013-06-01 05:08:10  0.3
#10 2013-06-02 18:37:46  0.3

Creating variable in R data frame depending on another data frame

Tags:

r

gattuso

2 Answers

Arun

Khashaa

Recent Activity

Donate For Us

Creating variable in R data frame depending on another data frame

Tags:

r

gattuso

2 Answers

Arun

Khashaa

Related questions

Recent Activity

Donate For Us