I would like to split a dataframe by multiple columns so that I can see the <code>summary()</code> output for each subset of the data. Here's a way to do that using <code>split()</code> from <code>base</code>: <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages ---------------------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats mtcars %>% select(1:3) %>% mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE), GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% split(list(.$GRP_A, .$GRP_B)) %>% map(summary) #> $A.1 #> mpg cyl disp GRP_A #> Min. :10.40 Min. :4.0 Min. :108.0 Length:10 #> 1st Qu.:14.97 1st Qu.:4.5 1st Qu.:151.9 Class :character #> Median :18.50 Median :7.0 Median :259.3 Mode :character #> Mean :17.61 Mean :6.4 Mean :283.4 #> 3rd Qu.:20.85 3rd Qu.:8.0 3rd Qu.:430.0 #> Max. :24.40 Max. :8.0 Max. :472.0 #> GRP_B #> Min. :1 #> 1st Qu.:1 #> Median :1 #> Mean :1 #> 3rd Qu.:1 #> Max. :1 #> #> $B.1 #> mpg cyl disp GRP_A #> Min. :15.00 Min. :4.0 Min. : 75.7 Length:5 #> 1st Qu.:21.00 1st Qu.:4.0 1st Qu.: 78.7 Class :character #> Median :21.50 Median :4.0 Median :120.1 Mode :character #> Mean :24.06 Mean :5.2 Mean :147.1 #> 3rd Qu.:30.40 3rd Qu.:6.0 3rd Qu.:160.0 #> Max. :32.40 Max. :8.0 Max. :301.0 #> GRP_B #> Min. :1 #> 1st Qu.:1 #> Median :1 #> Mean :1 #> 3rd Qu.:1 #> Max. :1 #> #> $A.2 #> mpg cyl disp GRP_A #> Min. :15.20 Min. :4.000 Min. : 95.1 Length:9 #> 1st Qu.:16.40 1st Qu.:6.000 1st Qu.:160.0 Class :character #> Median :18.10 Median :8.000 Median :275.8 Mode :character #> Mean :19.84 Mean :6.667 Mean :234.0 #> 3rd Qu.:21.00 3rd Qu.:8.000 3rd Qu.:275.8 #> Max. :30.40 Max. :8.000 Max. :360.0 #> GRP_B #> Min. :2 #> 1st Qu.:2 #> Median :2 #> Mean :2 #> 3rd Qu.:2 #> Max. :2 #> #> $B.2 #> mpg cyl disp GRP_A #> Min. :13.30 Min. :4 Min. : 71.1 Length:8 #> 1st Qu.:14.97 1st Qu.:4 1st Qu.:125.3 Class :character #> Median :20.55 Median :6 Median :201.5 Mode :character #> Mean :20.99 Mean :6 Mean :213.5 #> 3rd Qu.:23.93 3rd Qu.:8 3rd Qu.:315.5 #> Max. :33.90 Max. :8 Max. :360.0 #> GRP_B #> Min. :2 #> 1st Qu.:2 #> Median :2 #> Mean :2 #> 3rd Qu.:2 #> Max. :2 </code></pre> How can I achieve this same result using a <code>tidyverse</code> verb? My initial thought was to use <code>purrr::by_slice()</code>, but apparently that has been deprecated.

dplyr 0.8.0 has introduced the verb that you were looking for: <code>group_split()</code> From the documentation: <blockquote> <code>group_split</code>() works like <code>base::split()</code> but <ul> <li>it uses the grouping structure from group_by() and therefore is subject to the data mask</li> <li>it does not name the elements of the list based on the grouping as this typically loses information and is confusing.</li> </ul> <code>group_keys()</code> explains the grouping structure, by returning a data frame that has one row per group and one column per grouping variable. </blockquote> For your example: <pre class="prettyprint"><code>mtcars %>% select(1:3) %>% mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE), GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% group_split(GRP_A, GRP_B) %>% map(summary) </code></pre>

EDIT: this answer is now outdated. See @MartijnVanAttekum's solution above. <hr> The "tidy" solution seems to be a combination of "mutate + list-cols + purrr" according to Hadley. <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) library(magrittr) # group, nest, create a new col leveraging purrr::map() mt_summary <- mtcars %>% select(1:3) %>% mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE), GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% group_by(GRP_A, GRP_B) %>% nest() %>% mutate(SUMMARY = map(data, .f = summary)) # check the structure mt_summary #> # A tibble: 4 Ã— 4 #> GRP_A GRP_B data SUMMARY #> <chr> <int> <list> <list> #> 1 A 1 <tibble [11 Ã— 3]> <S3: table> #> 2 B 2 <tibble [9 Ã— 3]> <S3: table> #> 3 A 2 <tibble [7 Ã— 3]> <S3: table> #> 4 B 1 <tibble [5 Ã— 3]> <S3: table> # extract the summaries extract2(mt_summary, "SUMMARY") %>% set_names(paste0(extract2(mt_summary, "GRP_A"), extract2(mt_summary, "GRP_B"))) #> $A1 #> mpg cyl disp #> Min. :10.40 Min. :4.000 Min. : 75.7 #> 1st Qu.:15.25 1st Qu.:4.000 1st Qu.:120.9 #> Median :19.20 Median :6.000 Median :167.6 #> Mean :20.43 Mean :6.182 Mean :229.0 #> 3rd Qu.:25.85 3rd Qu.:8.000 3rd Qu.:309.5 #> Max. :30.40 Max. :8.000 Max. :460.0 #> #> $B2 #> mpg cyl disp #> Min. :15.20 Min. :4.000 Min. : 78.7 #> 1st Qu.:17.80 1st Qu.:4.000 1st Qu.:120.3 #> Median :19.20 Median :6.000 Median :167.6 #> Mean :20.84 Mean :6.222 Mean :225.9 #> 3rd Qu.:21.50 3rd Qu.:8.000 3rd Qu.:351.0 #> Max. :32.40 Max. :8.000 Max. :400.0 #> #> $A2 #> mpg cyl disp #> Min. :15.20 Min. :4.000 Min. : 71.1 #> 1st Qu.:18.90 1st Qu.:4.000 1st Qu.:114.5 #> Median :21.40 Median :6.000 Median :145.0 #> Mean :21.79 Mean :5.429 Mean :176.0 #> 3rd Qu.:22.10 3rd Qu.:6.000 3rd Qu.:241.5 #> Max. :33.90 Max. :8.000 Max. :304.0 #> #> $B1 #> mpg cyl disp #> Min. :10.40 Min. :4.0 Min. :140.8 #> 1st Qu.:13.30 1st Qu.:8.0 1st Qu.:275.8 #> Median :14.30 Median :8.0 Median :350.0 #> Mean :15.62 Mean :7.2 Mean :319.7 #> 3rd Qu.:17.30 3rd Qu.:8.0 3rd Qu.:360.0 #> Max. :22.80 Max. :8.0 Max. :472.0 </code></pre>

What is the tidyverse method for splitting a df by multiple columns?

Tags:

r

purrr

tidyverse

I would like to split a dataframe by multiple columns so that I can see the summary() output for each subset of the data.

Here's a way to do that using split() from base:

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

mtcars %>% 
  select(1:3) %>% 
  mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
         GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% 
  split(list(.$GRP_A, .$GRP_B)) %>% 
  map(summary)
#> $A.1
#>       mpg             cyl           disp          GRP_A          
#>  Min.   :10.40   Min.   :4.0   Min.   :108.0   Length:10         
#>  1st Qu.:14.97   1st Qu.:4.5   1st Qu.:151.9   Class :character  
#>  Median :18.50   Median :7.0   Median :259.3   Mode  :character  
#>  Mean   :17.61   Mean   :6.4   Mean   :283.4                     
#>  3rd Qu.:20.85   3rd Qu.:8.0   3rd Qu.:430.0                     
#>  Max.   :24.40   Max.   :8.0   Max.   :472.0                     
#>      GRP_B  
#>  Min.   :1  
#>  1st Qu.:1  
#>  Median :1  
#>  Mean   :1  
#>  3rd Qu.:1  
#>  Max.   :1  
#> 
#> $B.1
#>       mpg             cyl           disp          GRP_A          
#>  Min.   :15.00   Min.   :4.0   Min.   : 75.7   Length:5          
#>  1st Qu.:21.00   1st Qu.:4.0   1st Qu.: 78.7   Class :character  
#>  Median :21.50   Median :4.0   Median :120.1   Mode  :character  
#>  Mean   :24.06   Mean   :5.2   Mean   :147.1                     
#>  3rd Qu.:30.40   3rd Qu.:6.0   3rd Qu.:160.0                     
#>  Max.   :32.40   Max.   :8.0   Max.   :301.0                     
#>      GRP_B  
#>  Min.   :1  
#>  1st Qu.:1  
#>  Median :1  
#>  Mean   :1  
#>  3rd Qu.:1  
#>  Max.   :1  
#> 
#> $A.2
#>       mpg             cyl             disp          GRP_A          
#>  Min.   :15.20   Min.   :4.000   Min.   : 95.1   Length:9          
#>  1st Qu.:16.40   1st Qu.:6.000   1st Qu.:160.0   Class :character  
#>  Median :18.10   Median :8.000   Median :275.8   Mode  :character  
#>  Mean   :19.84   Mean   :6.667   Mean   :234.0                     
#>  3rd Qu.:21.00   3rd Qu.:8.000   3rd Qu.:275.8                     
#>  Max.   :30.40   Max.   :8.000   Max.   :360.0                     
#>      GRP_B  
#>  Min.   :2  
#>  1st Qu.:2  
#>  Median :2  
#>  Mean   :2  
#>  3rd Qu.:2  
#>  Max.   :2  
#> 
#> $B.2
#>       mpg             cyl         disp          GRP_A          
#>  Min.   :13.30   Min.   :4   Min.   : 71.1   Length:8          
#>  1st Qu.:14.97   1st Qu.:4   1st Qu.:125.3   Class :character  
#>  Median :20.55   Median :6   Median :201.5   Mode  :character  
#>  Mean   :20.99   Mean   :6   Mean   :213.5                     
#>  3rd Qu.:23.93   3rd Qu.:8   3rd Qu.:315.5                     
#>  Max.   :33.90   Max.   :8   Max.   :360.0                     
#>      GRP_B  
#>  Min.   :2  
#>  1st Qu.:2  
#>  Median :2  
#>  Mean   :2  
#>  3rd Qu.:2  
#>  Max.   :2

How can I achieve this same result using a tidyverse verb? My initial thought was to use purrr::by_slice(), but apparently that has been deprecated.

416

asked Mar 09 '17 20:03

Tiernan

2 Answers

dplyr 0.8.0 has introduced the verb that you were looking for: group_split()

From the documentation:

group_split() works like base::split() but

it uses the grouping structure from group_by() and therefore is subject to the data mask

it does not name the elements of the list based on the grouping as this typically loses information and is confusing.

group_keys() explains the grouping structure, by returning a data frame that has one row per group and one column per grouping variable.

For your example:

mtcars %>% 
  select(1:3) %>% 
  mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
         GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% 
  group_split(GRP_A, GRP_B) %>% 
  map(summary)

172

answered Oct 11 '22 11:10

MartijnVanAttekum

EDIT: this answer is now outdated. See @MartijnVanAttekum's solution above.

The "tidy" solution seems to be a combination of "mutate + list-cols + purrr" according to Hadley.

library(tidyverse) 
library(magrittr) 

# group, nest, create a new col leveraging purrr::map()
mt_summary <- 
    mtcars %>% 
    select(1:3) %>% 
    mutate(GRP_A = sample(LETTERS[1:2],  n(), replace = TRUE), 
           GRP_B = sample(c(1:2), n(), replace = TRUE)) %>% 
    group_by(GRP_A, GRP_B) %>% 
    nest() %>% 
    mutate(SUMMARY = map(data, .f = summary))

# check the structure
mt_summary
#> # A tibble: 4 Ã— 4
#>   GRP_A GRP_B              data     SUMMARY
#>   <chr> <int>            <list>      <list>
#> 1     A     1 <tibble [11 Ã— 3]> <S3: table>
#> 2     B     2  <tibble [9 Ã— 3]> <S3: table>
#> 3     A     2  <tibble [7 Ã— 3]> <S3: table>
#> 4     B     1  <tibble [5 Ã— 3]> <S3: table>

# extract the summaries
extract2(mt_summary, "SUMMARY") %>% 
    set_names(paste0(extract2(mt_summary, "GRP_A"), 
                     extract2(mt_summary, "GRP_B")))
#> $A1
#>       mpg             cyl             disp      
#>  Min.   :10.40   Min.   :4.000   Min.   : 75.7  
#>  1st Qu.:15.25   1st Qu.:4.000   1st Qu.:120.9  
#>  Median :19.20   Median :6.000   Median :167.6  
#>  Mean   :20.43   Mean   :6.182   Mean   :229.0  
#>  3rd Qu.:25.85   3rd Qu.:8.000   3rd Qu.:309.5  
#>  Max.   :30.40   Max.   :8.000   Max.   :460.0  
#> 
#> $B2
#>       mpg             cyl             disp      
#>  Min.   :15.20   Min.   :4.000   Min.   : 78.7  
#>  1st Qu.:17.80   1st Qu.:4.000   1st Qu.:120.3  
#>  Median :19.20   Median :6.000   Median :167.6  
#>  Mean   :20.84   Mean   :6.222   Mean   :225.9  
#>  3rd Qu.:21.50   3rd Qu.:8.000   3rd Qu.:351.0  
#>  Max.   :32.40   Max.   :8.000   Max.   :400.0  
#> 
#> $A2
#>       mpg             cyl             disp      
#>  Min.   :15.20   Min.   :4.000   Min.   : 71.1  
#>  1st Qu.:18.90   1st Qu.:4.000   1st Qu.:114.5  
#>  Median :21.40   Median :6.000   Median :145.0  
#>  Mean   :21.79   Mean   :5.429   Mean   :176.0  
#>  3rd Qu.:22.10   3rd Qu.:6.000   3rd Qu.:241.5  
#>  Max.   :33.90   Max.   :8.000   Max.   :304.0  
#> 
#> $B1
#>       mpg             cyl           disp      
#>  Min.   :10.40   Min.   :4.0   Min.   :140.8  
#>  1st Qu.:13.30   1st Qu.:8.0   1st Qu.:275.8  
#>  Median :14.30   Median :8.0   Median :350.0  
#>  Mean   :15.62   Mean   :7.2   Mean   :319.7  
#>  3rd Qu.:17.30   3rd Qu.:8.0   3rd Qu.:360.0  
#>  Max.   :22.80   Max.   :8.0   Max.   :472.0

answered Oct 11 '22 10:10

Tiernan

Related questions
                            
                                How do I build a Multiple Criteria Index in R, incorporating > and < operators?
                            
                                Efficient way to find manager's manager's id
                            
                                Drawing ggplot Footer Using linesGrob within grobTree
                            
                                Image popup on hover in DT in R
                            
                                R: ggplot background gradient coloring
                            
                                Creating a data frame with the contents of multiple txt files
                            
                                R - allocate a share of a number over different columns using an ifelse statement
                            
                                Why does this image made by R have a spurious vertical white line in it?
                            
                                Why does my NLOPT optimization error/fail to solve?
                            
                                ggplot2: Deleting facets of unused factor level combinations from a plot (facet_grid)
                            
                                Bookdown: Single html output file
                            
                                How to cast data from long to wide format in H2O?
                            
                                Match by id and divide column values across two dataframes
                            
                                R Change IP Address programmatically
                            
                                Join gap in polar line ggplot plot
                            
                                Merging 2 vectors and removing all repetitions
                            
                                double nesting with tidyverse and purrr
                            
                                How can I replace vector values in a sequence at regular intervals in R?
                            
                                Correlationmatrix from data table
                            
                                How to capture system2() output in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the tidyverse method for splitting a df by multiple columns?

Tags:

r

purrr

tidyverse

Tiernan

People also ask

2 Answers

MartijnVanAttekum

Tiernan

Recent Activity

Donate For Us

What is the **tidyverse** method for splitting a df by multiple columns?

Tags:

r

purrr

tidyverse

Tiernan

People also ask

2 Answers

MartijnVanAttekum

Tiernan

Related questions

Recent Activity

Donate For Us

What is the tidyverse method for splitting a df by multiple columns?