When max(x, na.rm = TRUE)
is called with no non-NA
values, it returns -Inf
, with a warning. However, in certain cases, the summarise
function in dplyr
does not return the warning:
library(magrittr)
library(dplyr)
df1 <- data.frame(a = c("a","b"), b = c(NA,NA))
df1 %>% group_by(a) %>% summarise(x = max(b, na.rm = TRUE))
# Three warnings, as expected.
df2 <- data.frame(a = c("a","b"), b = c(1,NA))
df2 %>% group_by(a) %>% summarise(x = max(b, na.rm = TRUE))
# No warning. Unexpected.
Interestingly, if I rename the function, I get the warnings as expected:
# Pointer to same function.
stat <- max
df1 <- data.frame(a = c("a","b"), b = c(NA,NA))
df1 %>% group_by(a) %>% summarise(x = stat(b, na.rm = TRUE))
# Three warnings, as expected.
df2 <- data.frame(a = c("a","b"), b = c(1,NA))
df2 %>% group_by(a) %>% summarise(x = stat(b, na.rm = TRUE))
# Single warning, as expected.
Actually, I think it should be two warnings instead of three, because there are only two groups to summarise
. But I am not sure how the internal warning system works, so perhaps three warnings is as expected.
My question is: Why does summarise
not output the warning in specific cases, and if that is expected, why would a simple rename of the function change this behaviour?
My sessionInfo()
:
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0.9000 magrittr_1.5
loaded via a namespace (and not attached):
[1] lazyeval_0.2.0.9000 R6_2.2.0 assertthat_0.1
[4] tools_3.3.2 DBI_0.5-1 tibble_1.2
[7] Rcpp_0.12.8
Although I am using the "dev" version of dplyr
, I have also tested it on the version available in CRAN, with the same results.
Below is a partial diagnosis; proves that somehow dplyr is messing up the reference to function name max()
. Also, dplyr generally uses SE (Standard Evaluation) on its args: lazyeval::lazydots(..., .follow_symbols=F))
, so maybe that affects the promise, although I can't see how:
A) group_by()
is not the culprit. df2 %>% group_by(a) %>% summarise(length(na.omit(b)))
does prove that group b is passing a vector with one NA element to max()
B) When we reference max by its qualified name base::max
, we do see the warning:
> df2 %>% group_by(a) %>% summarise(x = base::max(b, na.rm = TRUE))
a x
1 a 1
2 b -Inf
Warning message:
In base::max(NA_real_, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
And I checked that there is no dplyr:::max()
, so it's not namespace shadowing.
B2) Similarly, do.call(max, ...)
gives the warning as expected.
> df2 %>% group_by(a) %>% summarise(x = do.call(max, list(b, na.rm = TRUE)))
a x
1 a 1
2 b -Inf
Warning message:
In .Primitive("max")(NA_real_, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
C) Also, note dplyr generally uses SE (Standard Evaluation) on its args: lazyeval::lazydots(..., .follow_symbols=F))
, but I can't see how that would cause this.
C2) I tried to recreate the internal result of the group_by with:
grouped_df(as.numeric(NA), list()), na.rm=T)
and to recreate the promise with something like:
p <- lazyeval::lazy_dots( max, list( grouped_df(as.numeric(NA), list()), na.rm=T ) , .follow_symbols=F)
I couldn't manage to formulate that with .follow_symbols=T
I know almost nothing about Standard Evaluation, so sleuth on at http://adv-r.had.co.nz/Expressions.html#metaprogramming
Versions used: dplyr 0.5.0 ; lazyeval 0.1.10 ; although lazyeval 0.2.0 is Hadley's latest
For max()
, a hybrid version is available that works much faster for a grouped data frame, because the entire evaluation can be carried out in C++ without R callback for each group. In dplyr 0.5.0, the hybrid version is triggered when all of the following conditions are met:
logical
constantSee the hybrid vignette for more detail.
The hybrid version of max()
differs in certain aspects from the R implementation:
-Inf
NA
vector will return NA
even with na.rm = TRUE
In your example, c(NA, NA)
is a vector of logical
, so dplyr falls back to "regular" evaluation with one R callback for each group. If you need the original behavior, simply use a wrapper or an alias; the hybrid evaluator will fall back to regular evaluation:
max_ <- max
data_frame(a = NA_real_) %>% summarise(a = max_(a, na.rm = TRUE))
## # A tibble: 1 × 1
## a
## <dbl>
## 1 -Inf
## Warning message:
## In max_(a, na.rm = TRUE) : no non-missing arguments to max; returning -Inf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With