I ran into trouble today when using cur_data() within summarize().
Example data:
library(tidyverse)
dat <- tibble(id = 1:6,
type = c(1, 1, 2, 2, 3, 3),
value = c(2, 4, 6, 8, 7, NA))
This first pipeline throws an error, mentioning df_slice():
dat %>%
group_by(type) %>%
summarize(mean = mean(value),
n = length(cur_data() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
.groups = "drop")
#> Error in `summarize()`:
#> ! Problem while computing `n = length(...)`.
#> ℹ The error occurred in group 1: type = 1.
#> Caused by error:
#> ! Internal error in `df_slice()`: Columns must match the data frame size.
However, switching the order of the summary stats within summarize() avoids the error:
dat %>%
group_by(type) %>%
summarize(n = length(cur_data() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
mean = mean(value),
.groups = "drop")
#> # A tibble: 3 × 3
#> type n mean
#> <dbl> <int> <dbl>
#> 1 1 2 3
#> 2 2 2 7
#> 3 3 1 NA
Additionally, piping cur_data() into as.data.frame() also avoids the error:
dat %>%
group_by(type) %>%
summarize(mean = mean(value),
n = length(cur_data() %>% as.data.frame() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
.groups = "drop")
#> # A tibble: 3 × 3
#> type mean n
#> <dbl> <dbl> <int>
#> 1 1 3 2
#> 2 2 7 2
#> 3 3 NA 1
Created on 2022-02-15 by the reprex package (v2.0.1)
Why can I not use the first example syntax? Ultimately I calculated anything that required cur_data() within mutate() and just kept the first() observation within a later summarize() call, but I'd like to know what I'm missing about summarize().
Additional session info:
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20.6.0 (64-bit)
Running under: macOS Monterey 12.1
Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.1.2/lib/R/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reprex_2.0.1 palmerpenguins_0.1.0 forcats_0.5.1 stringr_1.4.0 readr_2.1.2
[6] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1 tidyr_1.2.0 purrr_0.3.4
[11] dplyr_1.0.8
This is an open issue with dplyr: https://github.com/tidyverse/dplyr/issues/6138
To paraphrase the discussion in the GitHub issue: The problem is caused by
cur_data() including the previously summarised column (in this case, mean),
without it having been recycled to match the number of rows in the data frame.
That makes cur_data() essentially a malfromed data frame.
In your case, using as.data.frame() solves the problem because it does
the recycling to make mean match the rest of the columns in length, and
having the statements in a different order solves the problem because at
that point cur_data() doesn’t include any new columns yet.
library(dplyr, warn.conflicts = FALSE)
dat <- tibble(
id = 1:6,
type = c(1, 1, 2, 2, 3, 3),
value = c(2, 4, 6, 8, 7, NA)
)
dat %>%
group_by(type) %>%
summarize(
mean = mean(value),
str(cur_data())
)
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#> $ id : int [1:2] 1 2
#> $ value: num [1:2] 2 4
#> $ mean : num 3
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#> $ id : int [1:2] 3 4
#> $ value: num [1:2] 6 8
#> $ mean : num 7
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#> $ id : int [1:2] 5 6
#> $ value: num [1:2] 7 NA
#> $ mean : num NA
#> # A tibble: 3 x 2
#> type mean
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 7
#> 3 3 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With