Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing grouping variables in purrr::map() with nested dataframes

Tags:

r

dplyr

purrr

tidyr

I'm using tidyr::nest() in combination with purrr::map() (-family) to group a data.frame into groups and then do some fancy stuff with each subset. Consider following example, and please ignore the fact that I don't need nest() and map() to do this (this is an oversimplified example):

library(dplyr)
library(purrr)
library(tidyr)

mtcars %>% 
  group_by(cyl) %>%
  nest() %>%
  mutate(
    wt_mean = map_dbl(data,~mean(.x$wt))
  )

# A tibble: 8 x 4
    cyl  gear data               cly2
  <dbl> <dbl> <list>            <dbl>
1     6     4 <tibble [4 x 9]>      6
2     4     4 <tibble [8 x 9]>      4
3     6     3 <tibble [2 x 9]>      6
4     8     3 <tibble [12 x 9]>     8
5     4     3 <tibble [1 x 9]>      4
6     4     5 <tibble [2 x 9]>      4
7     8     5 <tibble [2 x 9]>      8
8     6     5 <tibble [1 x 9]>      6

Usually when I do this type of operation, I need access to the grouping variable (cyl in this case) within map(). But these grouping variables appear as vectors with length corresponding to the number of rows in the nested dataframe, and therefore don't lend themselves easily.

Is there a way I could run the following operation? I would want the mean of wt to be divided by the number of cylinders (cyl) per group (i.e. row).

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  mutate(
    wt_mean = map_dbl(data,~mean(.x$wt)/cyl)
  )


Error in mutate_impl(.data, dots) : 
  Evaluation error: Result 1 is not a length 1 atomic vector.
like image 586
Ratnanil Avatar asked Jan 02 '23 12:01

Ratnanil


2 Answers

Take cyl out of the map call:

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  mutate(
    wt_mean = map_dbl(data, ~mean(.x$wt)) / cyl
  )

# A tibble: 8 x 4
    cyl  gear data              wt_mean
  <dbl> <dbl> <list>              <dbl>
1     6     4 <tibble [4 x 9]>    0.516
2     4     4 <tibble [8 x 9]>    0.595
3     6     3 <tibble [2 x 9]>    0.556
4     8     3 <tibble [12 x 9]>   0.513
5     4     3 <tibble [1 x 9]>    0.616
6     4     5 <tibble [2 x 9]>    0.457
7     8     5 <tibble [2 x 9]>    0.421
8     6     5 <tibble [1 x 9]>    0.462

map_dbl sees cyl as a length 8 vector because nest removes groups from data.frame. Using cyl in map_* function call (as in OP's example) results in 8 length-8 vectors.

2 other approaches:

Both with same result as above, but keep the grouped variables in the map_* call, per OP's specs:

Re grouping after nest

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  group_by(cyl, gear) %>%
  mutate(wt_mean = map_dbl(data,~mean(.x$wt)/cyl))

map2 for iterating over cyl

mtcars %>% 
  group_by(cyl,gear) %>%
  nest() %>%
  mutate(wt_mean = map2_dbl(data, cyl,~mean(.x$wt)/ .y))
like image 55
zack Avatar answered Jan 05 '23 00:01

zack


In the new release of dplyr 0-8-0, you can now use group_map, which I find very handy for this use case. This is the example by github user @yutannihilation

library(dplyr, warn.conflicts = FALSE)

mtcars %>% 
  group_by(cyl) %>%
  group_map(function(data, group_info) {
    tibble::tibble(wt_mean = mean(data$wt) / group_info$cyl)
  })
like image 33
Ratnanil Avatar answered Jan 05 '23 01:01

Ratnanil