Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group by in base R

Tags:

r

group-by

I would like to express the following SQL query using base R (without any particular package):

select month, day, count(*) as count, avg(dep_delay) as avg_delay
from flights
group by month, day
having count > 1000

It selects the mean departure delay and the number of flights per day on busy days (days with more than 1000 flights). The dataset is nycflights13 containing information of flights departed from NYC in 2013.

Notice I can easily write this in dplyr as:

flights %>%
  group_by(month, day) %>%
  summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  filter(count > 1000)
like image 862
Massimo Franceschet Avatar asked Apr 05 '18 10:04

Massimo Franceschet


2 Answers

Since I was reminded earlier about the elegance of by (tip of the hat to @Parfait), here is a solution using by:

res <- by(flights, list(flights$month, flights$day), function(x)
    if (nrow(x) > 1000) {
        c(
            month = unique(x$month),
            day = unique(x$day),
            count = nrow(x),
            avg_delay = mean(x$dep_delay, na.rm = TRUE))
        })

# Store in data.frame and order by month, day
df <- do.call(rbind, res);
df <- df[order(df[, 1], df[, 2]) ,];
#     month day count avg_delay
#[1,]     7   8  1004 37.296646
#[2,]     7   9  1001 30.711499
#[3,]     7  10  1004 52.860702
#[4,]     7  11  1006 23.609392
#[5,]     7  12  1002 25.096154
#[6,]     7  17  1001 13.670707
#[7,]     7  18  1003 20.626789
#[8,]     7  25  1003 19.674134
#[9,]     7  31  1001  6.280843
#[10,]     8   7  1001  8.680402
#[11,]     8   8  1001 43.349947
#[12,]     8  12  1001  8.308157
#[13,]    11  27  1014 16.697651
#[14,]    12   2  1004  9.021978
like image 108
Maurits Evers Avatar answered Oct 14 '22 03:10

Maurits Evers


as commented you can use a combi of subset and aggregate. Changed the order of day & month to recieve the same order as your dplyr approach. Using na.action = NULL to count rows inclunding NAs.

library(nycflights13) 
#> Warning: Paket 'nycflights13' wurde unter R Version 3.4.4 erstellt
subset(aggregate(dep_delay ~ day + month, flights, 
       function(x) cbind(count=length(x), avg_delay=mean(x, na.rm = TRUE)),
       na.action = NULL), 
       dep_delay[,1] > 1000)
#>     day month dep_delay.1 dep_delay.2
#> 189   8     7 1004.000000   37.296646
#> 190   9     7 1001.000000   30.711499
#> 191  10     7 1004.000000   52.860702
#> 192  11     7 1006.000000   23.609392
#> 193  12     7 1002.000000   25.096154
#> 198  17     7 1001.000000   13.670707
#> 199  18     7 1003.000000   20.626789
#> 206  25     7 1003.000000   19.674134
#> 212  31     7 1001.000000    6.280843
#> 219   7     8 1001.000000    8.680402
#> 220   8     8 1001.000000   43.349947
#> 224  12     8 1001.000000    8.308157
#> 331  27    11 1014.000000   16.697651
#> 336   2    12 1004.000000    9.021978

Created on 2018-04-05 by the reprex package (v0.2.0).

like image 21
Roman Avatar answered Oct 14 '22 03:10

Roman