Summarize data.table by group

Tags:

I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.

The dataset looks like this:

library(data.table)

# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)

So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.

What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.

As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).

I tried this:

df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]

which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.

What am I doing wrong?

325

asked Apr 10 '16 05:04

thiagoveloso

2 Answers

I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:

df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
##       Location       Date Model Temperature
##    1:        1 2000-01-01     A    2.469751
##    2:        2 2000-01-01     A   16.103135
##    3:        3 2000-01-01     A    7.147051
##    4:        4 2000-01-01     A   10.301937
##    5:        5 2000-01-01     A   16.760238
##   ---
## 2996:        6 2004-12-01     E   26.293968
## 2997:        7 2004-12-01     E    8.446528
## 2998:        8 2004-12-01     E   29.003001
## 2999:        9 2004-12-01     E   12.076765
## 3000:       10 2004-12-01     E   28.410980

If this is correct, you can generate the means across models with this:

df[,.(Mean=mean(Temperature)),.(Location,Date)];
##      Location       Date      Mean
##   1:        1 2000-01-01  9.498497
##   2:        2 2000-01-01 11.744622
##   3:        3 2000-01-01 15.691228
##   4:        4 2000-01-01 11.457154
##   5:        5 2000-01-01  8.897931
##  ---
## 596:        6 2004-12-01 17.587000
## 597:        7 2004-12-01 19.555963
## 598:        8 2004-12-01 15.710465
## 599:        9 2004-12-01 15.322790
## 600:       10 2004-12-01 20.240392

Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.

In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.

Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.

answered Oct 21 '22 23:10

bgoldst

If we are using data.table, the CJ can be used

 CJ(Location=loc, date= dates,Model= mods)[, 
         Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]

answered Oct 21 '22 23:10

akrun

Related questions
                            
                                general lag in time series panel data
                            
                                Extract formula from model in R
                            
                                Plotting discrete and continuous scales in same ggplot
                            
                                Subsetting R array: dimension lost when its length is 1
                            
                                Print a data frame with columns aligned (as displayed in R)
                            
                                Plotting envfit vectors (vegan package) in ggplot2
                            
                                How to make grouped layout in igraph?
                            
                                How to make variable bar widths in ggplot2 not overlap or gap
                            
                                How to GeoCode a simple address using Data Science Toolbox
                            
                                How to return R's NULL in Rcpp code?
                            
                                How can I tell which packages I am not using in my R script?
                            
                                How to add multiple figures across multiple pages in a chunk using knitr and RMarkdown?
                            
                                dplyr: Integer sampling within mutate
                            
                                Pass Parameters from Command line into R markdown document
                            
                                R system functions always returns error 127
                            
                                How to define "hidden global variables" inside R packages?
                            
                                Prevent Rstudio console from showing script commands
                            
                                Align multiple plots with varying spacings and add arrows between them
                            
                                Add curly braces to ggplot2 and then use ggsave
                            
                                ggplot jitter geom_errorbar?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Summarize data.table by group

Tags:

r

aggregate

data.table

mean

thiagoveloso

People also ask

2 Answers

bgoldst

akrun

Recent Activity

Donate For Us