I would like to split a dataframe by multiple columns so that I can see the summary()
output for each subset of the data.
Here's a way to do that using split()
from base
:
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
split(list(.$GRP_A, .$GRP_B)) %>%
map(summary)
#> $A.1
#> mpg cyl disp GRP_A
#> Min. :10.40 Min. :4.0 Min. :108.0 Length:10
#> 1st Qu.:14.97 1st Qu.:4.5 1st Qu.:151.9 Class :character
#> Median :18.50 Median :7.0 Median :259.3 Mode :character
#> Mean :17.61 Mean :6.4 Mean :283.4
#> 3rd Qu.:20.85 3rd Qu.:8.0 3rd Qu.:430.0
#> Max. :24.40 Max. :8.0 Max. :472.0
#> GRP_B
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
#>
#> $B.1
#> mpg cyl disp GRP_A
#> Min. :15.00 Min. :4.0 Min. : 75.7 Length:5
#> 1st Qu.:21.00 1st Qu.:4.0 1st Qu.: 78.7 Class :character
#> Median :21.50 Median :4.0 Median :120.1 Mode :character
#> Mean :24.06 Mean :5.2 Mean :147.1
#> 3rd Qu.:30.40 3rd Qu.:6.0 3rd Qu.:160.0
#> Max. :32.40 Max. :8.0 Max. :301.0
#> GRP_B
#> Min. :1
#> 1st Qu.:1
#> Median :1
#> Mean :1
#> 3rd Qu.:1
#> Max. :1
#>
#> $A.2
#> mpg cyl disp GRP_A
#> Min. :15.20 Min. :4.000 Min. : 95.1 Length:9
#> 1st Qu.:16.40 1st Qu.:6.000 1st Qu.:160.0 Class :character
#> Median :18.10 Median :8.000 Median :275.8 Mode :character
#> Mean :19.84 Mean :6.667 Mean :234.0
#> 3rd Qu.:21.00 3rd Qu.:8.000 3rd Qu.:275.8
#> Max. :30.40 Max. :8.000 Max. :360.0
#> GRP_B
#> Min. :2
#> 1st Qu.:2
#> Median :2
#> Mean :2
#> 3rd Qu.:2
#> Max. :2
#>
#> $B.2
#> mpg cyl disp GRP_A
#> Min. :13.30 Min. :4 Min. : 71.1 Length:8
#> 1st Qu.:14.97 1st Qu.:4 1st Qu.:125.3 Class :character
#> Median :20.55 Median :6 Median :201.5 Mode :character
#> Mean :20.99 Mean :6 Mean :213.5
#> 3rd Qu.:23.93 3rd Qu.:8 3rd Qu.:315.5
#> Max. :33.90 Max. :8 Max. :360.0
#> GRP_B
#> Min. :2
#> 1st Qu.:2
#> Median :2
#> Mean :2
#> 3rd Qu.:2
#> Max. :2
How can I achieve this same result using a tidyverse
verb? My initial thought was to use purrr::by_slice()
, but apparently that has been deprecated.
To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.
The unsplit() method has the following syntax. Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.
dplyr 0.8.0 has introduced the verb that you were looking for: group_split()
From the documentation:
group_split
() works likebase::split()
but
it uses the grouping structure from group_by() and therefore is subject to the data mask
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
group_keys()
explains the grouping structure, by returning a data frame that has one row per group and one column per grouping variable.
For your example:
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
group_split(GRP_A, GRP_B) %>%
map(summary)
EDIT: this answer is now outdated. See @MartijnVanAttekum's solution above.
The "tidy" solution seems to be a combination of "mutate + list-cols + purrr" according to Hadley.
library(tidyverse)
library(magrittr)
# group, nest, create a new col leveraging purrr::map()
mt_summary <-
mtcars %>%
select(1:3) %>%
mutate(GRP_A = sample(LETTERS[1:2], n(), replace = TRUE),
GRP_B = sample(c(1:2), n(), replace = TRUE)) %>%
group_by(GRP_A, GRP_B) %>%
nest() %>%
mutate(SUMMARY = map(data, .f = summary))
# check the structure
mt_summary
#> # A tibble: 4 × 4
#> GRP_A GRP_B data SUMMARY
#> <chr> <int> <list> <list>
#> 1 A 1 <tibble [11 × 3]> <S3: table>
#> 2 B 2 <tibble [9 × 3]> <S3: table>
#> 3 A 2 <tibble [7 × 3]> <S3: table>
#> 4 B 1 <tibble [5 × 3]> <S3: table>
# extract the summaries
extract2(mt_summary, "SUMMARY") %>%
set_names(paste0(extract2(mt_summary, "GRP_A"),
extract2(mt_summary, "GRP_B")))
#> $A1
#> mpg cyl disp
#> Min. :10.40 Min. :4.000 Min. : 75.7
#> 1st Qu.:15.25 1st Qu.:4.000 1st Qu.:120.9
#> Median :19.20 Median :6.000 Median :167.6
#> Mean :20.43 Mean :6.182 Mean :229.0
#> 3rd Qu.:25.85 3rd Qu.:8.000 3rd Qu.:309.5
#> Max. :30.40 Max. :8.000 Max. :460.0
#>
#> $B2
#> mpg cyl disp
#> Min. :15.20 Min. :4.000 Min. : 78.7
#> 1st Qu.:17.80 1st Qu.:4.000 1st Qu.:120.3
#> Median :19.20 Median :6.000 Median :167.6
#> Mean :20.84 Mean :6.222 Mean :225.9
#> 3rd Qu.:21.50 3rd Qu.:8.000 3rd Qu.:351.0
#> Max. :32.40 Max. :8.000 Max. :400.0
#>
#> $A2
#> mpg cyl disp
#> Min. :15.20 Min. :4.000 Min. : 71.1
#> 1st Qu.:18.90 1st Qu.:4.000 1st Qu.:114.5
#> Median :21.40 Median :6.000 Median :145.0
#> Mean :21.79 Mean :5.429 Mean :176.0
#> 3rd Qu.:22.10 3rd Qu.:6.000 3rd Qu.:241.5
#> Max. :33.90 Max. :8.000 Max. :304.0
#>
#> $B1
#> mpg cyl disp
#> Min. :10.40 Min. :4.0 Min. :140.8
#> 1st Qu.:13.30 1st Qu.:8.0 1st Qu.:275.8
#> Median :14.30 Median :8.0 Median :350.0
#> Mean :15.62 Mean :7.2 Mean :319.7
#> 3rd Qu.:17.30 3rd Qu.:8.0 3rd Qu.:360.0
#> Max. :22.80 Max. :8.0 Max. :472.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With