Calculate average monthly total by groups from data.table in R

Question

I have a data.table with a row for each day over a 30 year period with a number of different variable columns. The reason for using data.table is that the .csv file I'm using is huge (approx 1.2 million rows) as there are 30 years worth of data for a number of groups charactertised by a column called 'key'.

An example dataset is shown below:

Key   Date          Runoff
A     1980-01-01    2
A     1980-01-02    1
A     1981-01-01    0.1
A     1981-01-02    3
A     1982-01-01    2
A     1982-01-02    5
B     1980-01-01    1.5
B     1980-01-02    0.5
B     1981-01-01    0.3
B     1981-01-02    2
B     1982-01-01    1.5
B     1982-01-02    4

The above is a sample of two 'keys', with some data for January over three years to show what I mean. The actual dataset has hundreds of 'keys' and 30 years worth of data for each 'key'.

What I want to do is produce an output that has the total average for each month for each key as is shown below:

Key   January  February  March.... etc
A     4.36     ...       ...
B     3.26     ...       ...

i.e. the total average for January for Key A = (2 + 1) + (0.1 + 3) + (2 + 5) / 3

When I have done this analysis on one thirty year dataset (i.e. just one key) I have used the following code successfully to achieve this:

runoff_tot_average <- rowsum(DF$Runoff, format(DF$Date, '%m')) / 30

Where DF is the dataframe for one 30 year dataset.

So could I please have suggestions on how to modify my code above to work with the larger dataset with many 'keys' or offer a completely new solution!

EDIT

The below code produces the above data example:

Key <- c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02", "1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Key, Date, Runoff)

David Arenburg · Accepted Answer

They only way I could think of doing it was in two steps. Probably not the best way, but here goes

DT[, c("YM", "Month") := list(substr(Date, 1, 7), substr(Date, 6, 7))]
DT[, Runoff2 := sum(Runoff), by = c("Key", "YM")]
DT[, mean(Runoff2), by = c("Key", "Month")]

##   Key Month       V1
## 1:   A    01 4.366667
## 2:   B    01 3.266667

Just to show another (very similar) way:

DT[, c("year", "month") := list(year(Date), month(Date))]
DT[, Runoff2 := sum(Runoff), by=list(Key, year, month)]
DT[, mean(Runoff2), by=list(Key, month)]

Note that you don't have to create new columns, as by supports expressions as well. That is, you can directly use them in by as follows:

DT[, Runoff2 := sum(Runoff), by=list(Key, year = year(Date), month = month(Date))]

But since you require to aggregate more than once, it's better (for speed) to store them as additional columns, as @David has shown here.

eddi · Answer

If you're not looking for complicated functions and just want the mean, then the following should suffice:

DT[, sum(Runoff) / length(unique(year(Date))), list(Key, month(Date))]
#   Key month       V1
#1:   A     1 4.366667
#2:   B     1 3.266667

Calculate average monthly total by groups from data.table in R

Tags:

r

data.table

Catchment_Jack

Video Answer

2 Answers

David Arenburg

eddi

Recent Activity

Donate For Us

Calculate average monthly total by groups from data.table in R

Tags:

r

data.table

Catchment_Jack

Video Answer

2 Answers

David Arenburg

eddi

Related questions

Recent Activity

Donate For Us