I am seeking help after having wasted almost a day. I have a big data frame (bdf) and a small data frame (sdf). I want to add variable z to bdf depending on the value of sdf$y (which changes as a function of a time variable).
Here is a reproducible example:
bdf <- data.frame(tb = seq(as.POSIXct("2013-05-19 17:11:22 GMT", tz="GMT"), by=5624*24, length.out=10))
bdf
tb
1 2013-05-19 17:11:22
2 2013-05-21 06:40:58
3 2013-05-22 20:10:34
4 2013-05-24 09:40:10
5 2013-05-25 23:09:46
6 2013-05-27 12:39:22
7 2013-05-29 02:08:58
8 2013-05-30 15:38:34
9 2013-06-01 05:08:10
10 2013-06-02 18:37:46
sdf <- data.frame(ts = as.POSIXct(c("2013-05-22", "2013-05-25", "2013-05-30"), tz="GMT"), y = c(0.2, -0.1, 0.3))
> sdf
ts y
1 2013-05-22 0.2
2 2013-05-25 -0.1
3 2013-05-30 0.3
I want to create variable z in bdf with the following values of sdf$y:
0.2 for rows where bdf$tb ranges from the first bdf$tb value to mid-way between the 1st and 2nd value of sdf$ts. In this simple example, that is the case of rows 1 to 3 of dbf which have times bdf$tb below "2013-05-23 12:00:00 GMT".
-0.1 for rows where bdf$tb ranges from mid-way between the 1st and 2nd value of sdf$ts to mid-way between the 2nd and 3rd value of sdf$ts. In this simple example, that is the case of rows 4 and 5 of dbf which have times bdf$tb between "2013-05-23 12:00:00 GMT" and "2013-05-27 12:00:00 GMT".
0.3 for all rows where bdf$tb ranges from mid-way between the 2nd and 3rd value of sdf$ts to the last value of bdf$tb. In this simple example, that is the case of rows 1 to 6 to 10 of dbf which have times larger than "2013-05-23 12:00:00 GMT".
Hence, in the end, the big dataframe bdf should look like this:
tb z
1 2013-05-19 17:11:22 0.2
2 2013-05-21 06:40:58 0.2
3 2013-05-22 20:10:34 0.2
4 2013-05-24 09:40:10 -0.1
5 2013-05-25 23:09:46 -0.1
6 2013-05-27 12:39:22 0.3
7 2013-05-29 02:08:58 0.3
8 2013-05-30 15:38:34 0.3
9 2013-06-01 05:08:10 0.3
10 2013-06-02 18:37:46 0.3
I could not succeed using dplyr::mutate and got nowhere using loops... Any help would be much appreciated. I hope that I clearly described the issue as adhered to the etiquette (it is my first question).
Here's a solution using data.table
's rolling joins:
require(data.table)
setkey(setDT(sdf), ts)
sdf[bdf, roll = "nearest"]
# ts y
# 1: 2013-05-19 17:11:22 0.2
# 2: 2013-05-21 06:40:58 0.2
# 3: 2013-05-22 20:10:34 0.2
# 4: 2013-05-24 09:40:10 -0.1
# 5: 2013-05-25 23:09:46 -0.1
# 6: 2013-05-27 12:39:22 0.3
# 7: 2013-05-29 02:08:58 0.3
# 8: 2013-05-30 15:38:34 0.3
# 9: 2013-06-01 05:08:10 0.3
# 10: 2013-06-02 18:37:46 0.3
setDT
converts data.frame to data.table by reference.
setkey
sorts the data.table by reference in increasing order by the columns provided, and marks those columns as key columns (so that we can join on those key columns later.
In data.table, x[i]
performs a join when i
is a data.table. I'll refer you to this answer to catch up on data.table joins, if you're not already familiar with.
x[i]
performs an equi-join. That is, it finds matching row indices in x
for every row in i
and then extracts those rows from x
to return the join result along with the corresponding row from i
. In case a row in i
doesn't find matching row indices in x
, that row would have NA
for x
by default.
However, x[i, roll = .]
performs a rolling join. When there's no match, either the last observation is carried forward (roll = TRUE
or -Inf
), or the next observation can be carried backward (roll = Inf
), or rolled to the nearest value (roll = "nearest"
). And in this case you require roll = "nearest"
IIUC.
HTH
This seems now absolutely unnecessary, but in base R
bdf$z <- numeric(nrow(bdf))
for(i in seq_along(bdf$z)){
ind <- which.min(abs(bdf$tb[i] - sdf$ts))
bdf$z[i] <- sdf$y[ind]
}
While being little clumsy, it has an advantage in clarity, which accomodates easy adaptation to dplyr
library(dplyr)
bdf %>% rowwise() %>%
mutate(z= sdf$y[which.min(abs(as.numeric(tb)-as.numeric(sdf$ts)))])
#Source: local data frame [10 x 2]
#Groups: <by row>
# tb z
#1 2013-05-19 17:11:22 0.2
#2 2013-05-21 06:40:58 0.2
#3 2013-05-22 20:10:34 0.2
#4 2013-05-24 09:40:10 -0.1
#5 2013-05-25 23:09:46 -0.1
#6 2013-05-27 12:39:22 0.3
#7 2013-05-29 02:08:58 0.3
#8 2013-05-30 15:38:34 0.3
#9 2013-06-01 05:08:10 0.3
#10 2013-06-02 18:37:46 0.3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With