I'm finding the minimum date within a group. Many times, the group includes only missing dates (in which case I'd prefer something like NA
to be assigned).
The NA
s appear to be assigned correctly, but they're not responding to is.na()
as I expect. When a cell appears as NA
, the is.na()
output is unexpectedly FALSE.
library(magrittr)
ds_visit <- tibble::tribble(
~subject_id, ~date,
1L, as.Date("2017-01-01" ),
1L, as.Date("2017-02-01" ),
2L, as.Date(NA_character_),
2L, as.Date("2017-01-02" ),
3L, as.Date(NA_character_),
3L, as.Date(NA_character_),
4L, as.Date(NA_character_),
4L, as.Date(NA_character_)
)
ds_subject <- ds_visit %>%
# as.data.frame() %>%
dplyr::group_by(subject_id) %>%
dplyr::mutate(
date_na = is.na(date), # Works as expected
date_min = min(date, na.rm=T), # Works as expected
date_min_na = is.na(date_min) # Does NOT work as expected.
) %>%
dplyr::ungroup() # %>% as.data.frame()
ds_visit
looks correct. ds_subject
looks correct to me, except for that last column .
ds_subject (The last four rows of the last column are unexpected.)
# A tibble: 8 x 5
subject_id date date_na date_min date_min_na
<int> <date> <lgl> <date> <lgl>
1 1 2017-01-01 F 2017-01-01 F
2 1 2017-02-01 F 2017-01-01 F
3 2 NA T 2017-01-02 F
4 2 2017-01-02 F 2017-01-02 F
5 3 NA T NA F # Should be 'T'?
6 3 NA T NA F # Should be 'T'?
7 4 NA T NA F # Should be 'T'?
8 4 NA T NA F # Should be 'T'?
I've jittered several dimensions without success, including: (a) OS, (b) R version (including 3.4.3 patched),
(c) dplyr & rlang version (including CRAN and GitHub versions), and (d) tibble
vs. data.frame
. As a temp work around (not shown here), I've converted the date to a character right before finding the min, and then converting back to a date.
Warning messages (generated from subjects 3 & 4): Even though the warning message says that Inf
is returned, NA
appears when printing the dataset. (This behavior is consistent with min(as.Date(NA), na.rm=T)
).
1: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) :
no non-missing arguments to min; returning Inf
2: In min.default(c(NA_real_, NA_real_), na.rm = TRUE) :
no non-missing arguments to min; returning Inf
Further inspection of the date column seems consistent with the dataset view above. The type is a date and the last four cells are NA
, not infinity.
> str(ds_subject$date_min)
Date[1:8], format: "2017-01-01" "2017-01-01" "2017-01-02" "2017-01-02" NA NA NA NA
Is this a bug, or am I misusing something? Is this related NA
being produced instead of infinity?
The links below by @eipi10 and @mtoto help me understand better. Thanks. I'm not thrilled that 'NA' prints instead of 'Inf', but I'll try to remember that.
To address this specific scenario, is there a better function than base::min()
?
I'd like a function that I could include inside the dplyr::mutate()
/dplyr::summarize()
clause that behaves something like SQL. (The initial dplyr example still has that is.na()
problem when summarize()
replaces mutate()
).
For example:
"
SELECT
subject_id,
MIN(date) AS date_min
--MIN(date) OVER (PARTITION BY subject_id) AS date_min --`OVER` not supported by sqlite
FROM ds_visit
GROUP BY subject_id
" %>%
sqldf::sqldf() %>%
tibble::as_tibble() %>%
dplyr::mutate(
# date_min_na_1 = is.na(date_min), #Before conversion back to date (from numeric); same result as below.
date_min = as.Date(date_min, "1970-01-01"),
date_min_na = is.na(date_min)
)
Result where missing groups have well-mannered NA
values that respond as expected to is.na()
:
# A tibble: 4 x 3
subject_id date_min date_min_na
<int> <date> <lgl>
1 1 2017-01-01 F
2 2 2017-01-02 F
3 3 NA T
4 4 NA T
I see this question was marked as a duplicate of R Inf
when it has class Date
is printing NA
. I see a lot of overlap (and I learned a lot from that question and how it was problematic for my initial code), but I believe they're different questions.
This question involves grouping, and returning NA
when no nonmissing values are present. I'm not interested in solely base::min()
. As written above, ideally base::min()
is avoided altogether in favor of an established & tested function/approach that behaves more like SQL.
(Although I'm grateful for @alistaire's wrapper around base:min()
and will use it if an established function/approach doesn't exist.)
The issue is that min
with na.rm = TRUE
and all-NA
values returns Inf
(max
equivalently returns -Inf
), but print.Date
doesn't have a way to display those values, so it prints it as NA
, even though that is not the stored value.
min(NA, na.rm = TRUE)
#> Warning in min(NA, na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> [1] Inf
x <- min(as.Date(NA), na.rm = TRUE)
#> Warning in min.default(structure(NA_real_, class = "Date"), na.rm = TRUE):
#> no non-missing arguments to min; returning Inf
x
#> [1] NA
is.na(x)
#> [1] FALSE
x == Inf
#> [1] TRUE
If you like, you can redefine the print method so it prints however you like, e.g.
print.Date <- function(x, ...){
if(x == Inf | x == -Inf) {
print(as.numeric(x))
} else {
base::print.Date(x, ...)
}
}
x
#> [1] Inf
To actually get the result you want, specify what should be returned if all the values are NA
:
library(tidyverse)
ds_visit <- data_frame(subject_id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
date = as.Date(c("2017-01-01", "2017-02-01", NA, "2017-01-02", NA, NA, NA, NA)))
ds_visit %>%
group_by(subject_id) %>%
summarise(date_min = if(all(is.na(date))) NA else min(date, na.rm = TRUE),
date_min_na = is.na(date_min))
#> # A tibble: 4 x 3
#> subject_id date_min date_min_na
#> <int> <date> <lgl>
#> 1 1 2017-01-01 FALSE
#> 2 2 2017-01-02 FALSE
#> 3 3 NA TRUE
#> 4 4 NA TRUE
It's not as concise, but is entirely predictable in its behavior.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With