Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aggregate 1-minute data into 5-minute average data

Tags:

date

r

My question here is to aggregate the data collected at every 1-minute into 5-minute average.

DeviceTime         Concentration
6/20/2013 11:13       
6/20/2013 11:14
6/20/2013 11:15
6/20/2013 11:16
6/20/2013 11:17
6/20/2013 11:18
6/20/2013 11:19
6/20/2013 11:20
6/20/2013 11:21
6/20/2013 11:22
6/20/2013 11:23
6/20/2013 11:24
6/20/2013 11:25
6/20/2013 11:26
6/20/2013 11:27
6/20/2013 11:28

...

The result I want is like:

DeviceTime             Concentration
6/20/2013 11:15
6/20/2013 11:20
6/20/2013 11:25
6/20/2013 11:30
6/20/2013 11:35
...

The 5-minute average is just the simple average over the concentration in the past five minutes.

like image 254
Vicki1227 Avatar asked Mar 05 '14 16:03

Vicki1227


3 Answers

Using the dplyr package and assuming, your data is stored in a data frame named df:

require(dplyr)
df %>%
  group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
  summarize(Concentration = mean(Concentration))
like image 164
lukeA Avatar answered Nov 16 '22 06:11

lukeA


If your data doesn't start on a nice 5-minute wall clock boundary (as shown in your sample data – 11:13), note that cut() will create breakpoints based on the first timestamp it finds. This probably isn't what we normally want. Indeed, your sample output indicates this is not what you want.

Here's what cut() does:

df <- read.table(header=TRUE, sep=",", stringsAsFactors=FALSE, text="
DeviceTime,Concentration
6/20/2013 11:13,1
6/20/2013 11:14,1
6/20/2013 11:15,2
6/20/2013 11:16,2
6/20/2013 11:17,2
6/20/2013 11:18,2
6/20/2013 11:19,2
6/20/2013 11:20,3
6/20/2013 11:21,3
6/20/2013 11:22,3
6/20/2013 11:23,3
6/20/2013 11:24,3
6/20/2013 11:25,4")
df$DeviceTime <- as.POSIXct(df$DeviceTime, format="%m/%d/%Y %H:%M")

cut(df$DeviceTime, breaks="5 min")
 [1] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:13:00
 [4] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:18:00
 [7] 2013-06-20 11:18:00 2013-06-20 11:18:00 2013-06-20 11:18:00
[10] 2013-06-20 11:18:00 2013-06-20 11:23:00 2013-06-20 11:23:00
[13] 2013-06-20 11:23:00

means <- aggregate(df["Concentration"], 
                   list(fiveMin=cut(df$DeviceTime, "5 mins")),
                   mean)
means
              fiveMin Concentration
1 2013-06-20 11:13:00      1.600000
2 2013-06-20 11:18:00      2.600000
3 2013-06-20 11:23:00      3.333333

Notice that the first row of means (the 11:13:00 entry) is the mean of the first 5 rows of df, which have times of 11:13 to 11:17 -- i.e., up until just before the next cut/break point of 11:18.

You'll get the same result with dplyr (i.e., @lukeA's answer) if you use cut():

df %>%
  group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
  summarize(Concentration = mean(Concentration))
Source: local data frame [3 x 2]

           DeviceTime Concentration
1 2013-06-20 11:13:00      1.600000
2 2013-06-20 11:18:00      2.600000
3 2013-06-20 11:23:00      3.333333

The xts package seems to break by wall clock time:

require(xts)
df.xts <- xts(df$Concentration, df$DeviceTime)
means.xts <- period.apply(df.xts, endpoints(df.xts, "mins", k=5), mean)
means.xts
                    [,1]
2013-06-20 11:14:00    1
2013-06-20 11:19:00    2
2013-06-20 11:24:00    3
2013-06-20 11:25:00    4

The time values are always the last time entry found in the 5-min window. You can round the time index column up the the next 5-min boundary with align.time(), if you want to report the times of the end of the periods:

means.rounded <- align.time(means.xts, 5*60)
means.rounded
                    [,1]
2013-06-20 11:15:00    1
2013-06-20 11:20:00    2
2013-06-20 11:25:00    3
2013-06-20 11:30:00    4

You can also round down, if you want to report the times of the beginning of the periods. But you'll need to define your own function first (which I found on Cross Validated):

align.time.down = function(x,n) {
    index(x) = index(x) - n
    align.time(x,n)
}
means.rounded.down <- align.time.down(means.xts, 5*60)
means.rounded.down
                    [,1]
2013-06-20 11:10:00    1
2013-06-20 11:15:00    2
2013-06-20 11:20:00    3
2013-06-20 11:25:00    4

Another solution, that doesn't use the xts package, but rather floor(), is as follows:

df$DeviceTimeFloor <- as.POSIXct(floor(as.numeric(df$DeviceTime) / (5 * 60)) * (5 * 60), origin='1970-01-01')
meansFloor <- aggregate(Concentration ~ DeviceTimeFloor, df, mean)
meansFloor
      DeviceTimeFloor Concentration
1 2013-06-20 11:10:00             1
2 2013-06-20 11:15:00             2
3 2013-06-20 11:20:00             3
4 2013-06-20 11:25:00             4

I prefer to report the start time of the 5-minute interval – floor() is good for this. Because, if I were to report aggregates by hour, I would expect a timestamp of 2013-06-20 11:00:00 to contain data for the period 11:00:00 - 11:59:59 not 10:00:00 - 10:59:59.

If you prefer to report the end time of the intervals, ceiling() can be used instead of floor(). But note that timestamps 11:01 - 11:05 will be converted to (and hence grouped at) 11:05 by ceiling(). In contrast, floor() converts 11:00 - 11:04 to 11:00.

So they each group a different set of observations. The xts package will group the same set of observations as floor(), but it will report the last timestamp of the last observation in the period.

like image 33
Mark Rajcok Avatar answered Nov 16 '22 05:11

Mark Rajcok


I'd say the easiest and cleanest way to do this is using the lubridate and dplyr packages.

library(lubridate)  # for working with dates
library(dplyr)      # for manipulating data

df$DeviceTime5min <- floor_date(df$DeviceTime, "5 mins")
df_5min <- df %>% group_by(DeviceTime5min) %>% summarize(mean(Concentration))

Only problem here is that it works just for values, that fit into an hour ... that is: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60 min. But for these it works perfect :-)

like image 2
Marek Lahoda Avatar answered Nov 16 '22 05:11

Marek Lahoda