Emulate split() with dplyr group_by: return a list of data frames

Tags:

I have a large dataset that chokes split() in R. I am able to use dplyr group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames and similar).

consider a sample dataset:

df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2))) listDf = split(df,df$V1)

returns

$a    V1 V2 V3  1  a  1  2  2  a  2  3  $b    V1 V2 V3  3  b  3  4  4  b  4  2  $c    V1 V2 V3  5  c  5  2

I would like to emulate this with group_by (something like group_by(df,V1)) but this returns one, grouped_df. I know that do should be able to help me, but I am unsure about usage (also see link for a discussion.)

Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).

985

asked Nov 18 '15 08:11

MartinT

1 Answers

group_split in dplyr:

Dplyr has implemented group_split: https://dplyr.tidyverse.org/reference/group_split.html

It splits a dataframe by a groups, returns a list of dataframes. Each of these dataframes are subsets of the original dataframes defined by categories of the splitting variable.

For example. Split the dataset iris by the variable Species, and calculate summaries of each sub-dataset:

> iris %>%  +     group_split(Species) %>%  +     map(summary) [[1]]   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species    Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50    1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0    Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0    Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                    3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                    Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                    [[2]]   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species    Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0    1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50    Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0    Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                    3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                    Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                    [[3]]   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species    Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0    1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0    Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50    Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                    3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                    Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

It is also very helpful for debugging a calculations on nested dataframes, because it is an quick way to "see" what is going on "inside" the calculations on nested dataframes.

182

answered Sep 29 '22 15:09

Rasmus Larsen

Related questions
                            
                                How to extract elements from a list with mixed elements
                            
                                Convert character matrix into numeric matrix
                            
                                which(vector1 < vector2)
                            
                                Convert hex to decimal in R
                            
                                how to create md5 hash of a column in R?
                            
                                multiple graphs in one canvas using ggplot2
                            
                                unknown timezone name in R strptime/as.POSIXct
                            
                                NA values not being excluded in `cor`
                            
                                Animated sorted bar chart with bars overtaking each other
                            
                                Grid line consistent with ticks on axis
                            
                                ggplot2 heatmap with colors for ranged values
                            
                                Calculate percentage change in an R data frame
                            
                                Arrange a grouped_df by group variable not working
                            
                                Internal links in rmarkdown don't work
                            
                                Place a legend for each facet_wrap grid in ggplot2
                            
                                Batch convert columns to numeric type
                            
                                Sum of two Columns of Data Frame with NA Values
                            
                                Spearman correlation and ties
                            
                                How to extract sheet names from Excel file in R
                            
                                readOGR() cannot open file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Emulate split() with dplyr group_by: return a list of data frames

Tags:

list

split

r

dplyr

MartinT

People also ask

1 Answers

Rasmus Larsen

Recent Activity

Donate For Us