Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract one specific group in dplyr

Tags:

r

group-by

dplyr

Given a grouped tbl, can I extract one/few groups? Such function can be useful when prototyping code, e.g.:

mtcars %>%
  group_by(cyl) %>%
  select_first_n_groups(2) %>%
  do({'complicated expression'})

Surely, one can do an explicit filter before grouping, but that can be cumbersome.

like image 974
Rosen Matev Avatar asked Oct 22 '14 08:10

Rosen Matev


People also ask

How do you split data in a group in R?

Split vector and data frame in R, splitting data into groups depending on factor levels can be done with R's split() function. Split() is a built-in R function that divides a vector or data frame into groups according to the function's parameters.

What does dplyr group by do?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group".

Can you group by multiple columns in dplyr?

By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.

What function is group_by in R?

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.


1 Answers

With a bit of dplyr along with some nesting/unnesting (supported by tidyr package), you could establish a small helper to get the first (or any) group

first = function(x) x %>% nest %>% ungroup %>% slice(1) %>% unnest(data)
mtcars %>% group_by(cyl) %>% first()

By adjusting the slicing you could also extract the nth or any range of groups by index, but typically the first or the last is what most users want.

The name is inspired by functional APIs which all call it first (see stdlibs of i.e. kotlin, python, scala, java, spark).

Edit: Faster Version

A more scalable version (>50x faster on large datasets) that avoids nesting would be

first_group = function(x) x %>%
    select(group_cols()) %>%
    distinct %>%
    ungroup %>%
    slice(1) %>%
    { semi_join(x, .)}

A another positive side-effect of this improved version is that it fails if not grouping is present in x.

like image 131
Holger Brandl Avatar answered Sep 22 '22 14:09

Holger Brandl