Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr::group_by() to find min dates with NAs [duplicate]

Tags:

date

r

na

dplyr

I'm finding the minimum date within a group. Many times, the group includes only missing dates (in which case I'd prefer something like NA to be assigned).

The NAs appear to be assigned correctly, but they're not responding to is.na() as I expect. When a cell appears as NA, the is.na() output is unexpectedly FALSE.

library(magrittr)
ds_visit <- tibble::tribble(
  ~subject_id,                   ~date,
           1L,  as.Date("2017-01-01" ),
           1L,  as.Date("2017-02-01" ), 

           2L,  as.Date(NA_character_),        
           2L,  as.Date("2017-01-02" ),

           3L,  as.Date(NA_character_),        
           3L,  as.Date(NA_character_),   

           4L,  as.Date(NA_character_),        
           4L,  as.Date(NA_character_)       
)

ds_subject <- ds_visit %>% 
  # as.data.frame() %>% 
  dplyr::group_by(subject_id) %>% 
  dplyr::mutate(
    date_na     = is.na(date),          # Works as expected
    date_min    = min(date, na.rm=T),   # Works as expected

    date_min_na = is.na(date_min)       # Does NOT work as expected.
  ) %>% 
  dplyr::ungroup() # %>% as.data.frame() 

ds_visit looks correct. ds_subject looks correct to me, except for that last column .

ds_subject (The last four rows of the last column are unexpected.)

# A tibble: 8 x 5
  subject_id date       date_na date_min   date_min_na
       <int> <date>     <lgl>   <date>     <lgl>      
1          1 2017-01-01 F       2017-01-01 F          
2          1 2017-02-01 F       2017-01-01 F          
3          2 NA         T       2017-01-02 F          
4          2 2017-01-02 F       2017-01-02 F          
5          3 NA         T       NA         F         # Should be 'T'?
6          3 NA         T       NA         F         # Should be 'T'?
7          4 NA         T       NA         F         # Should be 'T'?
8          4 NA         T       NA         F         # Should be 'T'?

I've jittered several dimensions without success, including: (a) OS, (b) R version (including 3.4.3 patched), (c) dplyr & rlang version (including CRAN and GitHub versions), and (d) tibble vs. data.frame. As a temp work around (not shown here), I've converted the date to a character right before finding the min, and then converting back to a date.

Warning messages (generated from subjects 3 & 4): Even though the warning message says that Inf is returned, NA appears when printing the dataset. (This behavior is consistent with min(as.Date(NA), na.rm=T)).

1: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
2: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf

Further inspection of the date column seems consistent with the dataset view above. The type is a date and the last four cells are NA, not infinity.

> str(ds_subject$date_min)
 Date[1:8], format: "2017-01-01" "2017-01-01" "2017-01-02" "2017-01-02" NA NA NA NA

Is this a bug, or am I misusing something? Is this related NA being produced instead of infinity?

Edit 1

The links below by @eipi10 and @mtoto help me understand better. Thanks. I'm not thrilled that 'NA' prints instead of 'Inf', but I'll try to remember that.

To address this specific scenario, is there a better function than base::min()?

I'd like a function that I could include inside the dplyr::mutate()/dplyr::summarize() clause that behaves something like SQL. (The initial dplyr example still has that is.na() problem when summarize() replaces mutate()).

For example:

"
  SELECT 
    subject_id,
    MIN(date) AS date_min
    --MIN(date) OVER (PARTITION BY subject_id) AS date_min --`OVER` not supported by sqlite
  FROM ds_visit
  GROUP BY subject_id
" %>% 
  sqldf::sqldf() %>% 
  tibble::as_tibble() %>% 
  dplyr::mutate(
    # date_min_na_1 = is.na(date_min), #Before conversion back to date (from numeric); same result as below.
    date_min      = as.Date(date_min, "1970-01-01"),
    date_min_na   = is.na(date_min)
  )

Result where missing groups have well-mannered NA values that respond as expected to is.na():

# A tibble: 4 x 3
  subject_id date_min   date_min_na
       <int> <date>     <lgl>      
1          1 2017-01-01 F          
2          2 2017-01-02 F          
3          3 NA         T          
4          4 NA         T          

Edit 2

I see this question was marked as a duplicate of R Inf when it has class Date is printing NA. I see a lot of overlap (and I learned a lot from that question and how it was problematic for my initial code), but I believe they're different questions.

This question involves grouping, and returning NA when no nonmissing values are present. I'm not interested in solely base::min(). As written above, ideally base::min() is avoided altogether in favor of an established & tested function/approach that behaves more like SQL.

(Although I'm grateful for @alistaire's wrapper around base:min() and will use it if an established function/approach doesn't exist.)

like image 713
wibeasley Avatar asked Jan 26 '18 22:01

wibeasley


1 Answers

The issue is that min with na.rm = TRUE and all-NA values returns Inf (max equivalently returns -Inf), but print.Date doesn't have a way to display those values, so it prints it as NA, even though that is not the stored value.

min(NA, na.rm = TRUE)
#> Warning in min(NA, na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> [1] Inf

x <- min(as.Date(NA), na.rm = TRUE)
#> Warning in min.default(structure(NA_real_, class = "Date"), na.rm = TRUE):
#> no non-missing arguments to min; returning Inf

x
#> [1] NA

is.na(x)
#> [1] FALSE

x == Inf
#> [1] TRUE

If you like, you can redefine the print method so it prints however you like, e.g.

print.Date <- function(x, ...){
    if(x == Inf | x == -Inf) {
        print(as.numeric(x))
    } else {
        base::print.Date(x, ...)
    }
}

x
#> [1] Inf

To actually get the result you want, specify what should be returned if all the values are NA:

library(tidyverse)

ds_visit <- data_frame(subject_id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 
                       date = as.Date(c("2017-01-01", "2017-02-01", NA, "2017-01-02", NA, NA, NA, NA)))

ds_visit %>% 
    group_by(subject_id) %>% 
    summarise(date_min = if(all(is.na(date))) NA else min(date, na.rm = TRUE), 
              date_min_na = is.na(date_min))
#> # A tibble: 4 x 3
#>   subject_id date_min   date_min_na
#>        <int> <date>     <lgl>      
#> 1          1 2017-01-01 FALSE      
#> 2          2 2017-01-02 FALSE      
#> 3          3 NA         TRUE       
#> 4          4 NA         TRUE

It's not as concise, but is entirely predictable in its behavior.

like image 100
alistaire Avatar answered Oct 19 '22 07:10

alistaire