I have a large dataset that chokes split()
in R. I am able to use dplyr
group_by (which is a preferred way anyway) but I am unable to persist the resulting grouped_df
as a list of data frames, a format required by my consecutive processing steps (I need to coerce to SpatialDataFrames
and similar).
consider a sample dataset:
df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2))) listDf = split(df,df$V1)
returns
$a V1 V2 V3 1 a 1 2 2 a 2 3 $b V1 V2 V3 3 b 3 4 4 b 4 2 $c V1 V2 V3 5 c 5 2
I would like to emulate this with group_by
(something like group_by(df,V1)
) but this returns one, grouped_df
. I know that do
should be able to help me, but I am unsure about usage (also see link for a discussion.)
Note that split names each list by the name of the factor that has been used to establish this group - this is a desired function (ultimately, bonus kudos for a way to extract these names from the list of dfs).
Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.
The Pandas provide the feature to split Dataframe according to column index, row index, and column values, etc.
The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.
group_split in dplyr:
Dplyr has implemented group_split
: https://dplyr.tidyverse.org/reference/group_split.html
It splits a dataframe by a groups, returns a list of dataframes. Each of these dataframes are subsets of the original dataframes defined by categories of the splitting variable.
For example. Split the dataset iris
by the variable Species
, and calculate summaries of each sub-dataset:
> iris %>% + group_split(Species) %>% + map(summary) [[1]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0 Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0 Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 [[2]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50 Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0 Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500 Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800 [[3]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0 Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50 Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300 Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
It is also very helpful for debugging a calculations on nested dataframes, because it is an quick way to "see" what is going on "inside" the calculations on nested dataframes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With