Let's take the following data:
dt <- data.table(TICKER=c(rep("ABC",10),"DEF"),
PERIOD=c(rep(as.Date("2010-12-31"),10),as.Date("2011-12-31")),
DATE=as.Date(c("2010-01-05","2010-01-07","2010-01-08","2010-01-09","2010-01-10","2010-01-11","2010-01-13","2010-04-01","2010-04-02","2010-08-03","2011-02-05")),
ID=c(1,2,1,3,1,2,1,1,2,2,1),VALUE=c(1.5,1.3,1.4,1.6,1.4,1.2,1.5,1.7,1.8,1.7,2.3))
setkey(dt,TICKER,PERIOD,ID,DATE)
Now for each ticker/period combination, I need the following in a new column:
PRIORAVG
: The mean of the latest VALUE of each ID, excluding the current ID, providing it is no more than 180 days old.PREV
: The previous value from the same ID.The result should look like this:
TICKER PERIOD DATE ID VALUE PRIORAVG PREV
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30 1.5
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40 1.4
[5,] ABC 2010-12-31 2010-04-01 1 1.7 1.40 1.5
[6,] ABC 2010-12-31 2010-01-07 2 1.3 1.50 NA
[7,] ABC 2010-12-31 2010-01-11 2 1.2 1.50 1.3
[8,] ABC 2010-12-31 2010-04-02 2 1.8 1.65 1.2
[9,] ABC 2010-12-31 2010-08-03 2 1.7 1.70 1.8
[10,] ABC 2010-12-31 2010-01-09 3 1.6 1.35 NA
[11,] DEF 2011-12-31 2011-02-05 1 2.3 NA NA
Note the PRIORAVG
on row 9 is equal to 1.7 (which is equal to the VALUE
on row 5, which is the only prior observation in the past 180 days by another ID
)
I have discovered the data.table
package, but I can't seem to fully understand the :=
function. When I keep it simple, it seems to work. To obtain the previous value for each ID (I based this on the solution to this question):
dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),roll=TRUE,mult="last"][,VALUE]]
This works great, and it only takes 0.13 seconds to perform this operation over my dataset with ~250k rows; my vector scan function gets identical results but is about 30,000 times slower.
Ok, so I've got my first requirement. Let's get to the second, more complex requirement. Right now the fasted method so far for me is using a couple of vector scans and throwing the function through the plyr
function adply
to get the result for each row.
calc <- function(df,ticker,period,id,date) {
df <- df[df$TICKER == ticker & df$PERIOD == period
& df$ID != id & df$DATE < date & df$DATE > date-180, ]
df <- df[order(df$DATE),]
mean(df[!duplicated(df$ID, fromLast = TRUE),"VALUE"])
}
df <- data.frame(dt)
adply(df,1,function(x) calc(df,x$TICKER,x$PERIOD,x$ID,x$DATE))
I wrote the function for a data.frame
and it does not seem to work with a data.table
. For a subset of 5000 rows this takes about 44 seconds but my data consists of > 1 million rows. I wonder if this can be made more efficient through the usage of :=
.
dt[J("ABC"),last(VALUE),by=ID][,mean(V1)]
This works to select the average of the latest VALUEs for each ID for ABC.
dt[,PRIORAVG:=dt[J(TICKER,PERIOD),last(VALUE),by=ID][,mean(V1)]]
This, however, does not work as expected, as it takes the average of all last VALUEs for all ticker/periods instead of only for the current ticker/period. So it ends up with all rows getting the same mean value. Am I doing something wrong or is this a limitation of :=
?
table way. Unlike data. frame, the := operator adds a column to both the object living in the global environment and used in the function.
You create DataColumn objects within a table by using the DataColumn constructor, or by calling the Add method of the Columns property of the table, which is a DataColumnCollection. The Add method accepts optional ColumnName, DataType, and Expression arguments and creates a new DataColumn as a member of the collection.
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
Great question. Try this :
dt
TICKER PERIOD DATE ID VALUE
[1,] ABC 2010-12-31 2010-01-05 1 1.5
[2,] ABC 2010-12-31 2010-01-08 1 1.4
[3,] ABC 2010-12-31 2010-01-10 1 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5
[5,] ABC 2010-12-31 2010-01-07 2 1.3
[6,] ABC 2010-12-31 2010-01-11 2 1.2
[7,] ABC 2010-12-31 2010-01-09 3 1.6
[8,] DEF 2011-12-31 2011-02-05 1 2.3
ids = unique(dt$ID)
dt[,PRIORAVG:=NA_real_]
for (i in 1:nrow(dt))
dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]
dt
TICKER PERIOD DATE ID VALUE PRIORAVG
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40
[5,] ABC 2010-12-31 2010-01-07 2 1.3 1.50
[6,] ABC 2010-12-31 2010-01-11 2 1.2 1.50
[7,] ABC 2010-12-31 2010-01-09 3 1.6 1.35
[8,] DEF 2011-12-31 2011-02-05 1 2.3 NA
Then what you had already with a slight simplification ...
dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]
TICKER PERIOD DATE ID VALUE PRIORAVG PREV
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30 1.5
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40 1.4
[5,] ABC 2010-12-31 2010-01-07 2 1.3 1.50 NA
[6,] ABC 2010-12-31 2010-01-11 2 1.2 1.50 1.3
[7,] ABC 2010-12-31 2010-01-09 3 1.6 1.35 NA
[8,] DEF 2011-12-31 2011-02-05 1 2.3 NA NA
If this is ok as a prototype then a large speed improvement would be to keep the loop but use set()
instead of :=
, to reduce overhead :
for (i in 1:nrow(dt))
set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])
dt
TICKER PERIOD DATE ID VALUE PRIORAVG PREV
[1,] ABC 2010-12-31 2010-01-05 1 1.5 NA NA
[2,] ABC 2010-12-31 2010-01-08 1 1.4 1.30 1.5
[3,] ABC 2010-12-31 2010-01-10 1 1.4 1.45 1.4
[4,] ABC 2010-12-31 2010-01-13 1 1.5 1.40 1.4
[5,] ABC 2010-12-31 2010-01-07 2 1.3 1.50 NA
[6,] ABC 2010-12-31 2010-01-11 2 1.2 1.50 1.3
[7,] ABC 2010-12-31 2010-01-09 3 1.6 1.35 NA
[8,] DEF 2011-12-31 2011-02-05 1 2.3 NA NA
That should be a lot faster than the repeated vector scans shown in the question.
Or, the operation could be vectorized. But that would be less easy to write and read due to the features of this task.
Btw, there isn't any data in the question that would test the 180 day requirement. If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.
Another possible approach using later versions of data.table
:
library(data.table) #data.table_1.12.6 as of Nov 20, 2019
cols <- copy(names(DT))
DT[, c("MIN_DATE", "MAX_DATE") := .(DATE - 180L, DATE)]
DT[, PRIORAVG :=
.SD[.SD, on=.(TICKER, PERIOD, DATE>=MIN_DATE, DATE<=MAX_DATE),
by=.EACHI, {
subdat <- .SD[x.ID!=i.ID]
pavg <- if (subdat[, .N > 0L])
mean(subdat[, last(VALUE), ID]$V1, na.rm=TRUE)
else
NA_real_
c(setNames(mget(paste0("i.", cols)), cols), .(PRIORAVG=pavg))
}]$PRIORAVG
]
DT[, PREV := shift(VALUE), .(TICKER, PERIOD, ID)]
output:
TICKER PERIOD DATE ID VALUE MIN_DATE MAX_DATE PRIORAVG PREV
1: ABC 2010-12-31 2010-01-05 1 1.5 2009-07-09 2010-01-05 NA NA
2: ABC 2010-12-31 2010-01-08 1 1.4 2009-07-12 2010-01-08 1.30 1.5
3: ABC 2010-12-31 2010-01-10 1 1.4 2009-07-14 2010-01-10 1.45 1.4
4: ABC 2010-12-31 2010-01-13 1 1.5 2009-07-17 2010-01-13 1.40 1.4
5: ABC 2010-12-31 2010-04-01 1 1.7 2009-10-03 2010-04-01 1.40 1.5
6: ABC 2010-12-31 2010-01-07 2 1.3 2009-07-11 2010-01-07 1.50 NA
7: ABC 2010-12-31 2010-01-11 2 1.2 2009-07-15 2010-01-11 1.50 1.3
8: ABC 2010-12-31 2010-04-02 2 1.8 2009-10-04 2010-04-02 1.65 1.2
9: ABC 2010-12-31 2010-08-03 2 1.7 2010-02-04 2010-08-03 1.70 1.8
10: ABC 2010-12-31 2010-01-09 3 1.6 2009-07-13 2010-01-09 1.35 NA
11: DEF 2011-12-31 2011-02-05 1 2.3 2010-08-09 2011-02-05 NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With