Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert seconds interval data to hourly mean representation in LESS TIME

Tags:

r

time-series

I have a data file containing readings sampled at 30 seconds duration. File organization is:

> head(dframe)
            timestamp    power
1 2015-08-01 00:00:04 584.1379
2 2015-08-01 00:00:34 585.8087
3 2015-08-01 00:01:04 584.9335
4 2015-08-01 00:01:34 584.4366
5 2015-08-01 00:02:04 584.2829

Now to represent the 30 second duration data into hourly mean I use following R command:

df = aggregate(list(power=dframe$power),by=list(timestamp=cut(as.POSIXct(dframe$timestamp),"hour")),mean) 

This works perfectly. But, the actual issue is with time it takes for large files (data of one year). Can I somehow reduce the time it takes for conversion process? In other words, is there any other best alternative which takes less time for conversion of seconds data to hourly mean data in R?

UPDATE: I used 4 different methods for the same problem as suggested by @akrun and @Joshua. For other users of stack overflow here I am providing the usage of all methods and the respective time taken

dframe<-read.csv(path,head=TRUE,sep=",")
dframe$timestamp<- as.POSIXct(dframe$timestamp)
xframe = dframe
#using aggregate
system.time(
df1<- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")),mean)
)
# using data.table
system.time(
dfx<-setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(timestamp, 'hour'))]
)
# using dplyr
system.time( 
xframe %>% group_by(timestamp= cut(timestamp, 'hour')) %>% summarise(power=mean(power))
)
#using xts
system.time({
  x <- xts(dframe$power,dframe$timestamp)
  h <- period.apply(x, endpoints(x, "hours"), mean)
  h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})

Respective times taken on two (one month, three month) different datasets are: For one month data set:

Method       user  system elapsed 
Aggregate    0.137   0.005   0.142
data.table   0.031   0.001   0.032 
dplyr        0.035   0.001   0.036  
xts          0.053   0.000   0.053  

For three month data set:

Aggregate    0.456   0.019   0.475 
data.table   0.099   0.002   0.102  
dplyr        0.099   0.004   0.103  
xts          0.158   0.004   0.161

Caveat: All approaches except xts changes the type of timestamp from POSIXct to Factor. This means you have to convert again type of timestamp column, that will incur some more cpu cycles. In short, If in the end you further need POSIXct timestamp then xts is the best, otherwise go for data.table.

DATASET Dataset used can be found at link

like image 832
Haroon Rashid Avatar asked Mar 15 '23 10:03

Haroon Rashid


1 Answers

You can do this aggregation in less than half the time using tools from the xts package.

# sample data
set.seed(21)
N <- 2e6
dframe <- data.frame(timestamp=seq(Sys.time(), by="30 sec", length.out=N),
                     power=rnorm(N))
# aggregate
system.time(a <- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")), mean))
#    user  system elapsed 
#   2.456   0.000   2.457 

# xts
system.time({
  x <- xts(dframe$power, dframe$timestamp)
  h <- period.apply(x, endpoints(x, "hours"), mean)
  h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})
#    user  system elapsed 
#   0.888   0.004   0.893 
like image 88
Joshua Ulrich Avatar answered Apr 09 '23 15:04

Joshua Ulrich