I am using a dataset of flights. I try to calculate the average arrival and departure delay for different airports (origin). I tried it with the aggregate function:
average_delay <- aggregate(cbind(arr_delay,dep_delay) ~ origin, FUN = mean, data = flights)
print(average_delay, digits = 3)
After that I tried it also with tidyverse:
library(tidyverse)
average_delay_tidy = flights %>%
group_by(origin) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE),
dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
mutate_if(is.numeric, round, digits = 3) %>%
print
However I got different outputs:
Aggregate:
|origin|arr_delay |dep_delay |
:------|:--------:|:--------:|
|EWR | 9.56 | **15.0** |
|JFK | 5.85 | **12.0** |
|LGA | 6.11 | 10.3 |
Tidyverse:
<chr> <dbl> <dbl>
|origin|arr_delay |dep_delay |
:------|:--------:|:--------:|
|EWR | 9.56 | **15.1** |
|JFK | 5.85 | **12.1** |
|LGA | 6.11 | 10.3 |
De difference is small, however, I do not understand how it is possible? Can someone explain to me why these outputs are different? Is it due to the NA values?
Thanks in advance!
The difference is so small that it seemed to be related to rounding as tibble does some print formatting for display, but in aggregate, if there are NAs, specify the na.action argument which by default is na.omit i.e. it will remove the row completely if there is a single NA element in the row. Change it to NULL
aggregate(cbind(arr_delay,dep_delay) ~ origin, FUN = mean,
na.rm = TRUE, data = flights,
na.action = NULL)
-output
origin arr_delay dep_delay
1 EWR 9.107055 15.10795
2 JFK 5.551481 12.11216
3 LGA 5.783488 10.34688
To confirm if there are NA elements
colSums(is.na(flights[c('arr_delay', 'dep_delay')]))
arr_delay dep_delay
9430 8255
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With