Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time period buckets

Tags:

r

time-series

I've got logs of events that contain: start time, end time, category id and count. They cover several months.

I'd like to aggregate them over time to be able to trace histograms over a given day, week, month. So I assume the best way to do this is to bin the periods in buckets. I think 5 minutes would be good.

e.g. If an event starts at 1.01pm and ends at 1.07pm, I'd like to obtain 2 records for it as it covers 2 periods of 5 minutes (0-5 and 5-10) and replicate the rest of the original data for these new records (category and count)

if my input logs (x) are as such:

start / end / catid / count     
2012-11-17 15:05:02.0,  2012-11-17 15:12:52.0,  1, 2    
2012-11-17 15:07:13.0,  2012-11-17 15:17:47.0,  2, 10   
2012-11-17 15:11:00.0,  2012-11-17 15:12:33.0,  3, 5    
2012-11-17 15:12:01.0,  2012-11-17 15:20:00.0,  4, 1    

I'm trying to get the output bucketed in 5 minutes (b) this way:

start / catid / count   
2012-11-17 15:05:00.0   1, 2    
2012-11-17 15:10:00.0   1, 2

2012-11-17 15:05:00.0   2, 10   
2012-11-17 15:10:00.0   2, 10
2012-11-17 15:15:00.0   2, 10

2012-11-17 15:10:00.0   3, 5

2012-11-17 15:10:00.0   4, 1
2012-11-17 15:15:00.0   4, 1

Then I can easily aggregate the new data frame (b) over category ids for the period I want (hour, day, week, month)

I'm starting with R and I found a lot explanations on how to bucket a time value but not a period of time. I've had a look at zoo and xts but I couldn't quite find what to do.

Hopefully that makes sense to some of you.

Edit:

I've slightly modified Ram's suggestion to get the correct calculation of blocks using the rounded endtime rather than the original end time. (Thanks Ram!)

mnslot=15 # size of the buckets/slot in minutes

#Round down the minutes of starttime to a mutliple of mnslot
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
roundedmins <- floor(min_st/mnslot) * mnslot
st.base <- strptime(st, "%Y-%m-%d %H")
rounded_start <- st.base + (roundedmins * 60)

#Round down the minutes of the endtime to a multiple of mnslot.
en.str <- strptime(en, "%Y-%m-%d %H:%M:%S")
min_en <- as.numeric(format(en.str, "%M"))
roundedmins <- floor(min_en/mnslot) * mnslot
en.base <- strptime(en, "%Y-%m-%d %H")
rounded_end<- en.base + (roundedmins * 60)

# calculate the number of blocks based on the rounded minutes of start and end
numblocks<- as.numeric(floor((rounded_end-rounded_start)/mnslot/60)+1)
# differenced of POSIXct values is in minutes
# but difference of POSIXlt seems to be in seconds , so have to divide by 60 as well

#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start =     NULL
for (n in 1:length(numblocks)){
  for (newrow in  1:numblocks[n]){
    replicated_start =   c(replicated_start, df$rounded_start[n]+(newrow-1)*300   )  
    replicated_cat = c(replicated_cat,    df$catid[n]) 
    replicated_count = c(replicated_count, df$count[n]) 
  }
}

#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)

newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf

This produces the required output. it is a bit slow though:p

like image 696
bill ou Avatar asked Mar 26 '26 03:03

bill ou


1 Answers

Here's a fully working version. It involves step-by-step data manipulation for what you are after.

#storing the original data as a csv
df <- read.csv("tsdata.csv")
st<-as.POSIXlt(df$start)
en<-as.POSIXlt(df$end)

#a utility function to convert formats
unix2POSIXct  <-  function (time)   structure(time, class = c("POSIXt", "POSIXct") )

#For each row, determine how many replications are needed
numdups <- as.numeric(floor((en-st)/5)+1)

st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))

#Round down the minutes of start to 5 minute starts. 0,5,10 etc...
roundedmins <- floor(min_st/5) * 5
st.base <- strptime(st, "%Y-%m-%d %H")
df$rounded_start <- st.base + (roundedmins * 60)


#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start =     NULL
for (n in 1:length(numdups)){
  for (newrow in  1:numdups[n]){
    replicated_start =   c(replicated_start, df$rounded_start[n]+(newrow-1)*300   )  
    replicated_cat = c(replicated_cat,    df$catid[n]) 
    replicated_count = c(replicated_count, df$count[n]) 
  }
}

#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)

newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf

Which produces:

                start CatId Count
1 2012-11-17 15:05:00     1     2
2 2012-11-17 15:10:00     1     2
3 2012-11-17 15:05:00     2    10
4 2012-11-17 15:10:00     2    10
5 2012-11-17 15:15:00     2    10
6 2012-11-17 15:10:00     3     5
7 2012-11-17 15:10:00     4     1
8 2012-11-17 15:15:00     4     1
like image 185
Ram Narasimhan Avatar answered Mar 27 '26 15:03

Ram Narasimhan