Logo Questions Linux Laravel Mysql Ubuntu Git Menu

data.table time subset vs xts time subset

Hi I am looking to subset some minutely data by time. I normally use xts doing something like:

subset.string <- 'T10:00/T13:00' 

to get all the rows which are between 10am and 1pm (inclusive) EACH DAY and have the output as an xts format. But is a bit slow for my purposes...e.g

j <- xts(rnorm(10e6),Sys.time()-(10e6:1))
   user  system elapsed 
  5.704   0.577  17.115 

I know that data.table is v fast and at subsetting large datasets so am wondering if in conjunction with the fasttime package to deal with fast POSIXct creations, if it would be worth it to create a function like

dt.time.subset <- function(xts.min.obj, subset.string){
  x.dt <- data.table(ts=format(index(xts.min.obj),"%Y-%m-%d %H:%M:%S %Z"),
  out <- x.dt[,some.subsetting.operation.using."%between%"]

to convert the xts.min.obj into a data.table add some sort of character index and then use data.table to subset the relevant rows use the output row index with fasttime to recreate an xts output? or is this too many excess operations for something that is already highly optimised and written in C?

like image 229
h.l.m Avatar asked Jun 27 '13 14:06


People also ask

Why use xts?

xts makes it easy to join data by column and row using a few different functions. xts objects must be of identical type (e.g. integer + integer), or be POSIXct dates vector, or be atomic vectors of the same type (e.g. numeric), or be a single NA.

What does xts mean in R?

eXtensible Time Series (xts) is a powerful package that provides an extensible time series class, enabling uniform handling of many R time series classes by extending zoo.

1 Answers

If you're ok with specifying your range in UTC, you can do:

j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)]
# +60 because xts includes that minute; you'll need to offset the times
# appropriately to match with xts unless you live in UTC :)

j <- xts(rnorm(10e6),Sys.time()-(10e6:1))
system.time(j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)])
#  user  system elapsed 
#  1.17    0.08    1.25 
# likely faster on your machine as mine takes minutes to run the OP bench
like image 152
eddi Avatar answered Sep 24 '22 15:09
