Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating variable in R data frame depending on another data frame

Tags:

r

I am seeking help after having wasted almost a day. I have a big data frame (bdf) and a small data frame (sdf). I want to add variable z to bdf depending on the value of sdf$y (which changes as a function of a time variable).

Here is a reproducible example:

bdf <- data.frame(tb = seq(as.POSIXct("2013-05-19 17:11:22 GMT", tz="GMT"), by=5624*24, length.out=10))

bdf
                tb
1  2013-05-19 17:11:22
2  2013-05-21 06:40:58
3  2013-05-22 20:10:34
4  2013-05-24 09:40:10
5  2013-05-25 23:09:46
6  2013-05-27 12:39:22
7  2013-05-29 02:08:58
8  2013-05-30 15:38:34
9  2013-06-01 05:08:10
10 2013-06-02 18:37:46


sdf <- data.frame(ts = as.POSIXct(c("2013-05-22", "2013-05-25", "2013-05-30"), tz="GMT"), y = c(0.2, -0.1, 0.3))

> sdf
      ts    y
1 2013-05-22  0.2
2 2013-05-25 -0.1
3 2013-05-30  0.3

I want to create variable z in bdf with the following values of sdf$y:

  • 0.2 for rows where bdf$tb ranges from the first bdf$tb value to mid-way between the 1st and 2nd value of sdf$ts. In this simple example, that is the case of rows 1 to 3 of dbf which have times bdf$tb below "2013-05-23 12:00:00 GMT".

  • -0.1 for rows where bdf$tb ranges from mid-way between the 1st and 2nd value of sdf$ts to mid-way between the 2nd and 3rd value of sdf$ts. In this simple example, that is the case of rows 4 and 5 of dbf which have times bdf$tb between "2013-05-23 12:00:00 GMT" and "2013-05-27 12:00:00 GMT".

  • 0.3 for all rows where bdf$tb ranges from mid-way between the 2nd and 3rd value of sdf$ts to the last value of bdf$tb. In this simple example, that is the case of rows 1 to 6 to 10 of dbf which have times larger than "2013-05-23 12:00:00 GMT".

Hence, in the end, the big dataframe bdf should look like this:

                 tb    z
1  2013-05-19 17:11:22  0.2
2  2013-05-21 06:40:58  0.2
3  2013-05-22 20:10:34  0.2
4  2013-05-24 09:40:10 -0.1
5  2013-05-25 23:09:46 -0.1
6  2013-05-27 12:39:22  0.3
7  2013-05-29 02:08:58  0.3
8  2013-05-30 15:38:34  0.3
9  2013-06-01 05:08:10  0.3
10 2013-06-02 18:37:46  0.3

I could not succeed using dplyr::mutate and got nowhere using loops... Any help would be much appreciated. I hope that I clearly described the issue as adhered to the etiquette (it is my first question).

like image 919
gattuso Avatar asked Jan 03 '15 16:01

gattuso


2 Answers

Here's a solution using data.table's rolling joins:

require(data.table)
setkey(setDT(sdf), ts)
sdf[bdf, roll = "nearest"]
#                      ts    y
#  1: 2013-05-19 17:11:22  0.2
#  2: 2013-05-21 06:40:58  0.2
#  3: 2013-05-22 20:10:34  0.2
#  4: 2013-05-24 09:40:10 -0.1
#  5: 2013-05-25 23:09:46 -0.1
#  6: 2013-05-27 12:39:22  0.3
#  7: 2013-05-29 02:08:58  0.3
#  8: 2013-05-30 15:38:34  0.3
#  9: 2013-06-01 05:08:10  0.3
# 10: 2013-06-02 18:37:46  0.3
  • setDT converts data.frame to data.table by reference.

  • setkey sorts the data.table by reference in increasing order by the columns provided, and marks those columns as key columns (so that we can join on those key columns later.

  • In data.table, x[i] performs a join when i is a data.table. I'll refer you to this answer to catch up on data.table joins, if you're not already familiar with.

  • x[i] performs an equi-join. That is, it finds matching row indices in x for every row in i and then extracts those rows from x to return the join result along with the corresponding row from i. In case a row in i doesn't find matching row indices in x, that row would have NA for x by default.

    However, x[i, roll = .] performs a rolling join. When there's no match, either the last observation is carried forward (roll = TRUE or -Inf), or the next observation can be carried backward (roll = Inf), or rolled to the nearest value (roll = "nearest"). And in this case you require roll = "nearest" IIUC.

HTH

like image 73
Arun Avatar answered Sep 23 '22 02:09

Arun


This seems now absolutely unnecessary, but in base R

bdf$z <- numeric(nrow(bdf))
for(i in seq_along(bdf$z)){
  ind <- which.min(abs(bdf$tb[i] - sdf$ts))
  bdf$z[i] <- sdf$y[ind]
}

While being little clumsy, it has an advantage in clarity, which accomodates easy adaptation to dplyr

library(dplyr)
bdf %>% rowwise() %>% 
  mutate(z= sdf$y[which.min(abs(as.numeric(tb)-as.numeric(sdf$ts)))])

#Source: local data frame [10 x 2]
#Groups: <by row>

#                    tb    z
#1  2013-05-19 17:11:22  0.2
#2  2013-05-21 06:40:58  0.2
#3  2013-05-22 20:10:34  0.2
#4  2013-05-24 09:40:10 -0.1
#5  2013-05-25 23:09:46 -0.1
#6  2013-05-27 12:39:22  0.3
#7  2013-05-29 02:08:58  0.3
#8  2013-05-30 15:38:34  0.3
#9  2013-06-01 05:08:10  0.3
#10 2013-06-02 18:37:46  0.3
like image 42
Khashaa Avatar answered Sep 20 '22 02:09

Khashaa