Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join R data.tables where key values are not exactly equal--combine rows with closest times

Tags:

time

r

data.table

Is there a slick way to join data tables in R where key values of time are close, but not exactly the same? For example, suppose I have a data table of results that are given for different time periods:

DT1 = data.table(x=rep(c("a","b","c"),each=3), time=c(10,30,60), v=1:9)

Here, we have some values (v) for different categories (x) taken at different times (time). Now suppose that I have data from another source that provides some time values for the different categories:

DT2=data.table(x=rep(c("a","b","c"),each=1),time=c(10,10,60))

I might be interested in trying to match the times in DT2 as closely as I can to DT1 to predict a value, v, for my DT2 categories. I would like to do something like

setkeyv(DT2,c("x","time"))
merge(DT1,DT2,by=c("time","v")

Which returns:

   time x v
1:   10 a 1
2:   10 b 4
3:   60 c 9

But what if my times didn't have the same precision? For example:

DT2=data.table(x=rep(c("a","b","c"),each=1),time=c(17,54,3))

Is there a way to perform a similar merge but pick times of DT2 that are close to those of DT1? That is 17 would be close to 30, 54 close to 60, and 3 close to 10?

If this simple example isn't clear, I'll briefly explain the larger problem that I am having. I have a data table with columns: category, time, output1, output2... There are hundreds of categories with associated times. I might want to pull output 1 for all categories at a specific time. Since the times were sampled with no apparent logic, sometimes the time is rounded to the nearest even second; in other cases the times are rounded to the nearest minute or even 10 minutes.

I could write a script to rewrite times in more common format, but I am curious if there is a slick data.table solution that I haven't seen. I have explored the rolling merge with no success.

like image 715
Docuemada Avatar asked Mar 29 '13 23:03

Docuemada


2 Answers

Another option may be roll='nearest' (new in v1.8.8 on CRAN).

> setkey(DT1,x,time)
> DT1
   x time v
1: a   10 1
2: a   30 2
3: a   60 3
4: b   10 4
5: b   30 5
6: b   60 6
7: c   10 7
8: c   30 8
9: c   60 9
> DT2
   x time
1: a   17
2: b   54
3: c    3
> DT1[DT2,roll="nearest"]
   x time v
1: a   17 1
2: b   54 6
3: c    3 7

Note that 17 appears to be closer to 10 than 30, hence the result in the first row.

If you need to roll to the next observation (next observation carried backwards) :

> DT1[DT2,roll=-Inf]
   x time v
1: a   17 2
2: b   54 6
3: c    3 7
like image 193
Matt Dowle Avatar answered Sep 28 '22 01:09

Matt Dowle


You can use findInterval to accomplish this:

setkey(DT2, time)
DT1[, id := findInterval(DT1$time, DT2$time)]
DT2[, id := 1:3]

setkey(DT1, "x", "id")
setkey(DT2, "x", "id")
print(DT1[DT2][, id := NULL])
#    x time v time.1
# 1: a   30 2     17
# 2: b   60 6     54
# 3: c   10 7      3

The idea: First sort the data.table by time because the second argument of findInterval requires increasing order of values. Now, use findInterval to find in which interval of 3, 17, 54 does the values in DT1$time fall and store it in id. In this particular case, it happens to range from 1 to 3. So, set these values as id column for DT2. Once you find the intervals and get id, then it's straightforward. Instead of setting x and time, set x and id as keys and do your merge.

Note: Suppose your DT1$time had a value of 0, then, the interval for that would have been 0. So, you'd get 4 unique values (0:3). In that case, it may be better to have DT2 with a time = 0 value as well. I just wanted to note this point here. I'll leave it to you.

like image 34
Arun Avatar answered Sep 28 '22 02:09

Arun