Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

summarise does not return warning from max when no non-NA values

Tags:

r

warnings

dplyr

When max(x, na.rm = TRUE) is called with no non-NA values, it returns -Inf, with a warning. However, in certain cases, the summarise function in dplyr does not return the warning:

library(magrittr)
library(dplyr)

df1 <- data.frame(a = c("a","b"), b = c(NA,NA))
df1 %>% group_by(a) %>% summarise(x = max(b, na.rm = TRUE))
# Three warnings, as expected.

df2 <- data.frame(a = c("a","b"), b = c(1,NA))
df2 %>% group_by(a) %>% summarise(x = max(b, na.rm = TRUE))
# No warning. Unexpected.

Interestingly, if I rename the function, I get the warnings as expected:

# Pointer to same function.
stat <- max

df1 <- data.frame(a = c("a","b"), b = c(NA,NA))
df1 %>% group_by(a) %>% summarise(x = stat(b, na.rm = TRUE))
# Three warnings, as expected.

df2 <- data.frame(a = c("a","b"), b = c(1,NA))
df2 %>% group_by(a) %>% summarise(x = stat(b, na.rm = TRUE))
# Single warning, as expected.

Actually, I think it should be two warnings instead of three, because there are only two groups to summarise. But I am not sure how the internal warning system works, so perhaps three warnings is as expected.

My question is: Why does summarise not output the warning in specific cases, and if that is expected, why would a simple rename of the function change this behaviour?

My sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_0.5.0.9000 magrittr_1.5

loaded via a namespace (and not attached):
[1] lazyeval_0.2.0.9000 R6_2.2.0            assertthat_0.1
[4] tools_3.3.2         DBI_0.5-1           tibble_1.2
[7] Rcpp_0.12.8

Although I am using the "dev" version of dplyr, I have also tested it on the version available in CRAN, with the same results.

like image 817
nograpes Avatar asked Nov 30 '16 18:11

nograpes


2 Answers

Below is a partial diagnosis; proves that somehow dplyr is messing up the reference to function name max(). Also, dplyr generally uses SE (Standard Evaluation) on its args: lazyeval::lazydots(..., .follow_symbols=F)), so maybe that affects the promise, although I can't see how:

A) group_by() is not the culprit. df2 %>% group_by(a) %>% summarise(length(na.omit(b))) does prove that group b is passing a vector with one NA element to max()

B) When we reference max by its qualified name base::max, we do see the warning:

> df2 %>% group_by(a) %>% summarise(x = base::max(b, na.rm = TRUE))
       a     x
1      a     1
2      b  -Inf
Warning message:
In base::max(NA_real_, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

And I checked that there is no dplyr:::max(), so it's not namespace shadowing.

B2) Similarly, do.call(max, ...) gives the warning as expected.

> df2 %>% group_by(a) %>% summarise(x = do.call(max, list(b, na.rm = TRUE)))
       a     x
1      a     1
2      b  -Inf
Warning message:
In .Primitive("max")(NA_real_, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

C) Also, note dplyr generally uses SE (Standard Evaluation) on its args: lazyeval::lazydots(..., .follow_symbols=F)), but I can't see how that would cause this.

C2) I tried to recreate the internal result of the group_by with:

grouped_df(as.numeric(NA), list()), na.rm=T)

and to recreate the promise with something like:

p <- lazyeval::lazy_dots( max, list( grouped_df(as.numeric(NA), list()), na.rm=T )  , .follow_symbols=F)

I couldn't manage to formulate that with .follow_symbols=T

I know almost nothing about Standard Evaluation, so sleuth on at http://adv-r.had.co.nz/Expressions.html#metaprogramming

Versions used: dplyr 0.5.0 ; lazyeval 0.1.10 ; although lazyeval 0.2.0 is Hadley's latest

like image 173
smci Avatar answered Oct 17 '22 06:10

smci


For max(), a hybrid version is available that works much faster for a grouped data frame, because the entire evaluation can be carried out in C++ without R callback for each group. In dplyr 0.5.0, the hybrid version is triggered when all of the following conditions are met:

  • The first argument refers to a variable that exists in the data frame
  • The second argument is a logical constant

See the hybrid vignette for more detail.

The hybrid version of max() differs in certain aspects from the R implementation:

  • No warnings are raised for an empty vector, silently returning -Inf
    • I think this was always the case; we might as well add a warning here, but I suspect that other users won't be happy about this
  • An all-NA vector will return NA even with na.rm = TRUE
    • This is certainly a bug, I filed an issue

In your example, c(NA, NA) is a vector of logical, so dplyr falls back to "regular" evaluation with one R callback for each group. If you need the original behavior, simply use a wrapper or an alias; the hybrid evaluator will fall back to regular evaluation:

max_ <- max
data_frame(a = NA_real_) %>% summarise(a = max_(a, na.rm = TRUE))
## # A tibble: 1 × 1
##       a
##   <dbl>
## 1  -Inf
## Warning message:
## In max_(a, na.rm = TRUE) : no non-missing arguments to max; returning -Inf
like image 39
krlmlr Avatar answered Oct 17 '22 08:10

krlmlr