From a data frame with timestamped rows (strptime results), what is the best method for aggregating statistics for intervals?
Intervals could be an hour, a day, etc.
There's the aggregate
function, but that doesn't help with assigning each row to an interval. I'm planning on adding a column to the data frame that denotes interval and using that with aggregate
, but if there's a better solution it'd be great to hear it.
Thanks for any pointers!
Example Data
Five rows with timestamps divided into 15-minute intervals starting at 03:00.
Interval 1
Interval 2
Conclusion
Using a time series package such as xts
should be the solution; however I had no success using them and winded up using cut
. As I presently only need to plot histograms, with rows grouped by interval, this was enough.
cut
is used liked so:
interv <- function(x, start, period, num.intervals) {
return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}
Standard functions to split vectors are cut
and findInterval
:
v <- as.POSIXct(c(
"2010-01-13 03:02:38 UTC",
"2010-01-13 03:08:14 UTC",
"2010-01-13 03:14:52 UTC",
"2010-01-13 03:20:42 UTC",
"2010-01-13 03:22:19 UTC"
))
# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"
# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00
# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2
For the record: cut
has a method for POSIXt
type, but unfortunately there is no way to provide start
argument, effect is:
cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00
As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).
Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.
The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf
Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With