Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Speeding up approximate date match. idata.frame?

I am struggling to efficiently perform a "close" date match between two data frames. This question explores a solution using idata.frame from the plyr package, but I would be very happy with other suggested solutions as well.

Here is a very simplistic version of the two data frames:

sampleticker<-data.frame(cbind(ticker=c("A","A","AA","AA"),
  date=c("2005-1-25","2005-03-30","2005-02-15","2005-04-21")))
sampleticker$date<-as.Date(sampleticker$date,format="%Y-%m-%d")

samplereport<-data.frame(cbind(ticker=c("A","A","A","AA","AA","AA"),
  rdate=c("2005-2-15","2005-03-15","2005-04-15",
  "2005-03-01","2005-04-20","2005-05-01")))
samplereport$rdate<-as.Date(samplereport$rdate,format="%Y-%m-%d")

In the actual data, sampleticker is over 30,000 rows with 40 columns, and samplereport almost 300,000 rows with 25 columns.

What I would like to do is to merge the two data frames so that each row in sampleticker is combined with the closest date match in samplereport which occurs AFTER the date in sampleticker. I have solved similar problems in the past by doing a simple merge on the ticker field, sorting ascending, and then selecting unique combinations of ticker and date. However, due to the size of this dataset, the merge blows up extremely quickly.

As near as I can tell, merge does not allow this sort of approximate matching. I have seen some solutions which use findInterval, but since the distance between the dates will vary, I am not sure that I can specify an interval that will work for all rows.

Following another post here, I have written the following code to use adply on each row and to perform the join:

library(plyr)
merge<-adply(sampleticker,1,function(x){
  y<-subset(samplereport,ticker %in% x$ticker & rdate > x$date)
  y[which.min(y$rdate),]
  }))

This works quite nicely: for the sample data, I get the below, which is what I want.

   date       ticker      rdate
 1 2005-01-25  A          2005-02-15
 2 2005-03-30  A          2005-04-15
 3 2005-02-15  AA         2005-03-01
 4 2005-04-21  AA         2005-05-01

However, since the code performs 30,000+ subsetting operations, it is extremely slow: I ran the above query for more than a day before finally killing it.

I see here that plyr 1.0 has a structure, idata.frame, which calls the dataframe by reference, dramatically speeding up the subsetting operation. However, I cannot get the following code to work:

isamplereport<-idata.frame(samplereport)
adply(sampleticker,1,function(x){
  y<-subset(isamplereport,isamplereport$ticker %in% x$ticker & 
    isamplereport$rdate > x$date)
  y[which.min(y$rdate),]
})

I get the error

Error in list_to_dataframe(res, attr(.data, "split_labels")) : 
Results must be all atomic, or all data frames

This makes sense to me, since the operation returns an idata.frame (I assume). However, changing the last line to:

as.data.frame(y[which.min(y$rdate),]) 

also throws an error:

Error in `[.data.frame`(x$`_data`, x$`_rows`, x$`_cols`) : 
undefined columns selected.

Note that calling as.data.frame on the plain old samplereport returns the original data frame, as expected.

I know that idata.frame is experimental, so I didn't necessarily expect it to work properly. However, if anyone has an idea on how to fix this, I would appreciate it. Alternately, if anyone could suggest a completely different approach that runs more efficiently, that would be fantastic.

Matt

UPDATE Data.table is the right way to go about this. See below.

like image 493
Matt Avatar asked Feb 13 '12 22:02

Matt


2 Answers

Thanks to Matthew Dowle and his addition of the ability to roll backwards as well as forwards in data.table, it is now much simpler to perform this merge.

ST <- data.table(sampleticker)
SR <- data.table(samplereport)
setkey(ST,ticker,date)
SR[,mergerdate:=rdate]
setkey(SR,ticker,mergerdate)
merge<-SR[ST,roll=-Inf]
setnames(merge,"mergerdate","date")

#    ticker       date      rdate
# 1:      A 2005-01-25 2005-02-15
# 2:      A 2005-03-30 2005-04-15
# 3:     AA 2005-02-15 2005-03-01
# 4:     AA 2005-04-21 2005-05-01
like image 68
Matt Avatar answered Oct 25 '22 19:10

Matt


Here is a data.table-based solution that's likely to work better than what you are currently using:

library(data.table)
ST <- data.table(sampleticker, key="ticker")
SR <- data.table(samplereport, key="ticker")
SR <- SR[with(SR, order(ticker, rdate)),] # rdates need to be in increasing order

SR[ST, list(date = date,
            rdate = rdate[match(TRUE, (rdate > date))]), ]
     ticker       date      rdate
[1,]      A 2005-01-25 2005-02-15
[2,]      A 2005-03-30 2005-04-15
[3,]     AA 2005-02-15 2005-03-01
[4,]     AA 2005-04-21 2005-05-01

Of course, it sounds like what you really want to do is to merge together two much wider data.frames. To demonstrate one way of accomplishing that, in the example below, I add some columns to both data.tables, and then show how you could merge the appropriate rows:

# Add some columns to both data.tables
ST$alpha <- letters[seq_len(nrow(ST))]
SR$n     <- seq_len(nrow(SR))
SR$ALPHA <- LETTERS[seq_len(nrow(SR))]

# Perform a merge that includes the whole rows from samplereport
# corresponding to the selected rdate
RES <- SR[ST, cbind(date, .SD[match(TRUE,(rdate>date)),-1]), ]

# Merge res (containing the selected rows from samplereport) back together
# with sampleticker
keycols <- c("ticker", "date")
setkeyv(RES, keycols)
setkeyv(ST, keycols)
ST[RES]
#      ticker       date alpha      rdate n ALPHA
# [1,]      A 2005-01-25     a 2005-02-15 1     A
# [2,]      A 2005-03-30     b 2005-04-15 3     C
# [3,]     AA 2005-02-15     c 2005-03-01 4     D
# [4,]     AA 2005-04-21     d 2005-05-01 6     F
like image 26
Josh O'Brien Avatar answered Oct 25 '22 20:10

Josh O'Brien