Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarize data.table by group

I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.

The dataset looks like this:

library(data.table)

# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)

So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.

What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.

As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).

I tried this:

df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]

which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.

What am I doing wrong?

like image 325
thiagoveloso Avatar asked Apr 10 '16 05:04

thiagoveloso


People also ask

How do you summarize a data table?

Select the column to summarize on With a cell selected in an Add-In for Excel table, click the ACL Add-In tab and select Summarize > Summarize. Select a column of any data type to summarize on. Optional To omit the count or percentage for the unique values in the column, clear Include count or Include percentage.

How do I summarize a data table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library. The following examples show how to use these functions in practice.

What does Summarise () do in R?

The summarize() function is used in the R program to summarize the data frame into just one value or vector. This summarization is done through grouping observations by using categorical values at first, using the groupby() function. The dplyr package is used to get the summary of the dataset.

How do I sum a group in R?

Group By Sum in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by sum in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df results to get the group by sum.


2 Answers

I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:

df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
##       Location       Date Model Temperature
##    1:        1 2000-01-01     A    2.469751
##    2:        2 2000-01-01     A   16.103135
##    3:        3 2000-01-01     A    7.147051
##    4:        4 2000-01-01     A   10.301937
##    5:        5 2000-01-01     A   16.760238
##   ---
## 2996:        6 2004-12-01     E   26.293968
## 2997:        7 2004-12-01     E    8.446528
## 2998:        8 2004-12-01     E   29.003001
## 2999:        9 2004-12-01     E   12.076765
## 3000:       10 2004-12-01     E   28.410980

If this is correct, you can generate the means across models with this:

df[,.(Mean=mean(Temperature)),.(Location,Date)];
##      Location       Date      Mean
##   1:        1 2000-01-01  9.498497
##   2:        2 2000-01-01 11.744622
##   3:        3 2000-01-01 15.691228
##   4:        4 2000-01-01 11.457154
##   5:        5 2000-01-01  8.897931
##  ---
## 596:        6 2004-12-01 17.587000
## 597:        7 2004-12-01 19.555963
## 598:        8 2004-12-01 15.710465
## 599:        9 2004-12-01 15.322790
## 600:       10 2004-12-01 20.240392

Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.

In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.

Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.

like image 74
bgoldst Avatar answered Oct 21 '22 23:10

bgoldst


If we are using data.table, the CJ can be used

 CJ(Location=loc, date= dates,Model= mods)[, 
         Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]
like image 29
akrun Avatar answered Oct 21 '22 23:10

akrun