Why does cur_data() within summarize() return df_slice() error?

Question

I ran into trouble today when using cur_data() within summarize().

Example data:

library(tidyverse)

dat <- tibble(id = 1:6,
              type = c(1, 1, 2, 2, 3, 3),
              value = c(2, 4, 6, 8, 7, NA))

This first pipeline throws an error, mentioning df_slice():

dat %>%
  group_by(type) %>%
  summarize(mean = mean(value),
            n = length(cur_data() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
            .groups = "drop")
#> Error in `summarize()`:
#> ! Problem while computing `n = length(...)`.
#> ℹ The error occurred in group 1: type = 1.
#> Caused by error:
#> ! Internal error in `df_slice()`: Columns must match the data frame size.

However, switching the order of the summary stats within summarize() avoids the error:

dat %>%
  group_by(type) %>%
  summarize(n = length(cur_data() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
            mean = mean(value),
            .groups = "drop")
#> # A tibble: 3 × 3
#>    type     n  mean
#>   <dbl> <int> <dbl>
#> 1     1     2     3
#> 2     2     2     7
#> 3     3     1    NA

Additionally, piping cur_data() into as.data.frame() also avoids the error:

dat %>%
  group_by(type) %>%
  summarize(mean = mean(value),
            n = length(cur_data() %>% as.data.frame() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
            .groups = "drop")
#> # A tibble: 3 × 3
#>    type  mean     n
#>   <dbl> <dbl> <int>
#> 1     1     3     2
#> 2     2     7     2
#> 3     3    NA     1
Created on 2022-02-15 by the reprex package (v2.0.1)

Why can I not use the first example syntax? Ultimately I calculated anything that required cur_data() within mutate() and just kept the first() observation within a later summarize() call, but I'd like to know what I'm missing about summarize().

Additional session info:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20.6.0 (64-bit)
Running under: macOS Monterey 12.1

Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.1.2/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reprex_2.0.1         palmerpenguins_0.1.0 forcats_0.5.1        stringr_1.4.0        readr_2.1.2         
 [6] tibble_3.1.6         ggplot2_3.3.5        tidyverse_1.3.1      tidyr_1.2.0          purrr_0.3.4         
[11] dplyr_1.0.8

Mikko Marttila · Accepted Answer

This is an open issue with dplyr: https://github.com/tidyverse/dplyr/issues/6138

To paraphrase the discussion in the GitHub issue: The problem is caused by cur_data() including the previously summarised column (in this case, mean), without it having been recycled to match the number of rows in the data frame. That makes cur_data() essentially a malfromed data frame.

In your case, using as.data.frame() solves the problem because it does the recycling to make mean match the rest of the columns in length, and having the statements in a different order solves the problem because at that point cur_data() doesn’t include any new columns yet.

library(dplyr, warn.conflicts = FALSE)

dat <- tibble(
  id = 1:6,
  type = c(1, 1, 2, 2, 3, 3),
  value = c(2, 4, 6, 8, 7, NA)
)

dat %>%
  group_by(type) %>%
  summarize(
    mean = mean(value),
    str(cur_data())
  )
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ id   : int [1:2] 1 2
#>  $ value: num [1:2] 2 4
#>  $ mean : num 3
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ id   : int [1:2] 3 4
#>  $ value: num [1:2] 6 8
#>  $ mean : num 7
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ id   : int [1:2] 5 6
#>  $ value: num [1:2] 7 NA
#>  $ mean : num NA
#> # A tibble: 3 x 2
#>    type  mean
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2     7
#> 3     3    NA

Why does cur_data() within summarize() return df_slice() error?

Tags:

r

dplyr

tidyverse

billybarc

1 Answers

Mikko Marttila

Recent Activity

Donate For Us

Why does cur_data() within summarize() return df_slice() error?

Tags:

r

dplyr

tidyverse

billybarc

1 Answers

Mikko Marttila

Related questions

Recent Activity

Donate For Us