I have a data.table with a row for each day over a 30 year period with a number of different variable columns. The reason for using data.table is that the .csv file I'm using is huge (approx 1.2 million rows) as there are 30 years worth of data for a number of groups charactertised by a column called 'key'.
An example dataset is shown below:
Key Date Runoff
A 1980-01-01 2
A 1980-01-02 1
A 1981-01-01 0.1
A 1981-01-02 3
A 1982-01-01 2
A 1982-01-02 5
B 1980-01-01 1.5
B 1980-01-02 0.5
B 1981-01-01 0.3
B 1981-01-02 2
B 1982-01-01 1.5
B 1982-01-02 4
The above is a sample of two 'keys', with some data for January over three years to show what I mean. The actual dataset has hundreds of 'keys' and 30 years worth of data for each 'key'.
What I want to do is produce an output that has the total average for each month for each key as is shown below:
Key January February March.... etc
A 4.36 ... ...
B 3.26 ... ...
i.e. the total average for January for Key A = (2 + 1) + (0.1 + 3) + (2 + 5) / 3
When I have done this analysis on one thirty year dataset (i.e. just one key) I have used the following code successfully to achieve this:
runoff_tot_average <- rowsum(DF$Runoff, format(DF$Date, '%m')) / 30
Where DF is the dataframe for one 30 year dataset.
So could I please have suggestions on how to modify my code above to work with the larger dataset with many 'keys' or offer a completely new solution!
EDIT
The below code produces the above data example:
Key <- c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02", "1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Key, Date, Runoff)
They only way I could think of doing it was in two steps. Probably not the best way, but here goes
DT[, c("YM", "Month") := list(substr(Date, 1, 7), substr(Date, 6, 7))]
DT[, Runoff2 := sum(Runoff), by = c("Key", "YM")]
DT[, mean(Runoff2), by = c("Key", "Month")]
## Key Month V1
## 1: A 01 4.366667
## 2: B 01 3.266667
Just to show another (very similar) way:
DT[, c("year", "month") := list(year(Date), month(Date))]
DT[, Runoff2 := sum(Runoff), by=list(Key, year, month)]
DT[, mean(Runoff2), by=list(Key, month)]
Note that you don't have to create new columns, as by
supports expressions as well. That is, you can directly use them in by
as follows:
DT[, Runoff2 := sum(Runoff), by=list(Key, year = year(Date), month = month(Date))]
But since you require to aggregate more than once, it's better (for speed) to store them as additional columns, as @David has shown here.
If you're not looking for complicated functions and just want the mean, then the following should suffice:
DT[, sum(Runoff) / length(unique(year(Date))), list(Key, month(Date))]
# Key month V1
#1: A 1 4.366667
#2: B 1 3.266667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With