Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to count occurrences in 5 minute segments?

Tags:

r

I have a matrix, events, that contains the times of occurrences of 5 million events. Each of these 5 million events has a "type" that ranges from 1 to 2000. A very simplified version of the matrix is as below. The units for "times" is seconds since 1970. All of the events have occurred since 1/1/2012.

>events
      type          times
      1           1352861760
      1           1362377700
      2           1365491820
      2           1368216180
      2           1362088800
      2           1362377700

I am trying to divide the time since 1/1/2012 into 5-minute buckets and then populate each of these buckets with how many of each event of type i has occurred in each bucket. My code is below. Note that types is a vector containing each possible type from 1-2000, and by is set to 300 because that is how many seconds are in 5 minutes.

for(i in 1:length(types)){
    local <- events[events$type==types[i],c("type", "times")]
    assign(sprintf("a%d", i),table(cut(local$times, breaks=seq(range(events$times)[1],range(events$times)[2], by=300))))
}

This results in variables a1 through a2000 which contains a row vector of how many occurrences of type i there were in each of the 5-minute buckets.

I proceed to then find all pairwise correlations between 'a1' and 'a2000'.

Is there a way to optimize the chunk of code I provided above? It runs very slow, yet I can't think of a way to make it faster. Perhaps there are just too many buckets and too little time.

Any insight would be much appreciated.

Reproducible example:

>head(events)
     type         times
      12           1308575460
      12           1308676680
      12           1308825420
      12           1309152660
      12           1309879140
      25           1309946460

xevents <- xts(events[,"type"],.POSIXct(events[,"times"]))
ep <- endpoints(xevents, "minutes", 5)
counts <- period.apply(xevents, ep, tabulate, nbins=length(types))

>head(counts)
                       1    2    3    4    5   6    7    8    9   10   11  12   13   14
2011-06-20 09:11:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-06-21 13:18:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-06-23 06:37:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-06-27 01:31:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-07-05 11:19:00    0    0    0    0    0   0    0    0    0    0    0   1    0   0
2011-07-06 06:01:00    0    0    0    0    0   0    0    0    0    0    0   0    0   0

>> ep[1:20]
[1]  0  1  2  3  4  5  6  7  8  9 10 12 20 21 22 23 24 25 26 27

Above is the code I have been using, but the problem is that it hasn't incremented by 5 minutes: it just increments by the occurrences of actual events.

like image 386
user2588829 Avatar asked Mar 24 '23 01:03

user2588829


1 Answers

I would use the xts package for this. Running a function over non-overlapping 5-minute intervals is easy with the period.apply and endpoints functions.

# create sample data
library(xts)
set.seed(21)
N <- 1e6
events <- cbind(sample(2000, N, replace=TRUE),
  as.POSIXct("2012-01-01")+sample(1e7,N))
colnames(events) <- c("type","times")
# create xts object
xevents <- xts(events[,"type"], .POSIXct(events[,"times"]))
# find the last row of each non-overlapping 5-minute interval
ep <- endpoints(xevents, "minutes", 5)
# count the number of occurrences of each "type"
counts <- period.apply(xevents, ep, tabulate, nbins=2000)
# set colnames
colnames(counts) <- paste0("a",1:ncol(counts))
# calculate correlation
#cc <- cor(counts)

Update to respond to OP's comments/edits:

# Create a sequence of 5-minute steps, from the actual start of the data
m5 <- seq(round(start(xevents),'mins'), end(xevents), by='5 mins')
# Create a sequence of 5-minute steps, from the start of 2012-01-01
m5 <- seq(as.POSIXct("2012-01-01"), end(xevents), by='5 mins')
# merge xevents with empty 5-minute xts object, and
# subtract 1 second, so endpoints are at end of each 5-minute interval
xevents5 <- merge(xevents, xts(,m5-1))
ep5 <- endpoints(xevents5, "minutes", 5)
counts5 <- period.apply(xevents5, ep5, tabulate, nbins=2000)
colnames(counts5) <- paste0("a",1:ncol(counts5))
# align to the beginning of each 5-minute interval, if you want
counts5 <- align.time(counts5,60*5)
like image 78
Joshua Ulrich Avatar answered Apr 25 '23 14:04

Joshua Ulrich