I've got logs of events that contain: start time, end time, category id and count. They cover several months.
I'd like to aggregate them over time to be able to trace histograms over a given day, week, month. So I assume the best way to do this is to bin the periods in buckets. I think 5 minutes would be good.
e.g. If an event starts at 1.01pm and ends at 1.07pm, I'd like to obtain 2 records for it as it covers 2 periods of 5 minutes (0-5 and 5-10) and replicate the rest of the original data for these new records (category and count)
if my input logs (x) are as such:
start / end / catid / count
2012-11-17 15:05:02.0, 2012-11-17 15:12:52.0, 1, 2
2012-11-17 15:07:13.0, 2012-11-17 15:17:47.0, 2, 10
2012-11-17 15:11:00.0, 2012-11-17 15:12:33.0, 3, 5
2012-11-17 15:12:01.0, 2012-11-17 15:20:00.0, 4, 1
I'm trying to get the output bucketed in 5 minutes (b) this way:
start / catid / count
2012-11-17 15:05:00.0 1, 2
2012-11-17 15:10:00.0 1, 2
2012-11-17 15:05:00.0 2, 10
2012-11-17 15:10:00.0 2, 10
2012-11-17 15:15:00.0 2, 10
2012-11-17 15:10:00.0 3, 5
2012-11-17 15:10:00.0 4, 1
2012-11-17 15:15:00.0 4, 1
Then I can easily aggregate the new data frame (b) over category ids for the period I want (hour, day, week, month)
I'm starting with R and I found a lot explanations on how to bucket a time value but not a period of time. I've had a look at zoo and xts but I couldn't quite find what to do.
Hopefully that makes sense to some of you.
Edit:
I've slightly modified Ram's suggestion to get the correct calculation of blocks using the rounded endtime rather than the original end time. (Thanks Ram!)
mnslot=15 # size of the buckets/slot in minutes
#Round down the minutes of starttime to a mutliple of mnslot
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
roundedmins <- floor(min_st/mnslot) * mnslot
st.base <- strptime(st, "%Y-%m-%d %H")
rounded_start <- st.base + (roundedmins * 60)
#Round down the minutes of the endtime to a multiple of mnslot.
en.str <- strptime(en, "%Y-%m-%d %H:%M:%S")
min_en <- as.numeric(format(en.str, "%M"))
roundedmins <- floor(min_en/mnslot) * mnslot
en.base <- strptime(en, "%Y-%m-%d %H")
rounded_end<- en.base + (roundedmins * 60)
# calculate the number of blocks based on the rounded minutes of start and end
numblocks<- as.numeric(floor((rounded_end-rounded_start)/mnslot/60)+1)
# differenced of POSIXct values is in minutes
# but difference of POSIXlt seems to be in seconds , so have to divide by 60 as well
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numblocks)){
for (newrow in 1:numblocks[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
}
}
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf
This produces the required output. it is a bit slow though:p
Here's a fully working version. It involves step-by-step data manipulation for what you are after.
#storing the original data as a csv
df <- read.csv("tsdata.csv")
st<-as.POSIXlt(df$start)
en<-as.POSIXlt(df$end)
#a utility function to convert formats
unix2POSIXct <- function (time) structure(time, class = c("POSIXt", "POSIXct") )
#For each row, determine how many replications are needed
numdups <- as.numeric(floor((en-st)/5)+1)
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
#Round down the minutes of start to 5 minute starts. 0,5,10 etc...
roundedmins <- floor(min_st/5) * 5
st.base <- strptime(st, "%Y-%m-%d %H")
df$rounded_start <- st.base + (roundedmins * 60)
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numdups)){
for (newrow in 1:numdups[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
}
}
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf
Which produces:
start CatId Count
1 2012-11-17 15:05:00 1 2
2 2012-11-17 15:10:00 1 2
3 2012-11-17 15:05:00 2 10
4 2012-11-17 15:10:00 2 10
5 2012-11-17 15:15:00 2 10
6 2012-11-17 15:10:00 3 5
7 2012-11-17 15:10:00 4 1
8 2012-11-17 15:15:00 4 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With