Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can dplyr generate data frame for each group after the group_by operation?

Tags:

r

dplyr

I was very shocked by the smoothness of dplyr package in flow-style data processing. Recently I rush into a problem to generate a new data frame for each group ID and combine those small data frames into a final larger data frame. A toy example:

input.data.frame %>%
    group_by(gid) %>%
    {some operation to generate a new data frame for each group} ## FAILED!!!!

In dplyr, the function mutate adding new column to each group and summarise generating summaries for each group, both can not fulfill my requirement. (Did I miss something?)

Alternatively, using ddply of plyr package, the previous interation of dplyr, I can make it via

ddply(input.data.frame, .(gid), function(x) {
     some operation to generate a new data frame for each group
}

But the shortage is some funcs in dplyr will be masked from availableness when I load the plyr package.

like image 382
caesar0301 Avatar asked Nov 07 '14 08:11

caesar0301


2 Answers

Here is an example following the answer by G. Grothendieck to a similar question. Adding rows in `dplyr` output

First we generate a data frame with x and g. There are 9 random numbers in x and 3 groups a,b,c in g. We want to select 2 largest numbers from each group. It is important to remember that do requires a data frame as return value.

library(dplyr)
set.seed(1)
dat <- data.frame(x=runif(9),g=rep(letters[1:3],each=3))

dat
      x g
1 0.1765568 a
2 0.6870228 a
3 0.3841037 a
4 0.7698414 b
5 0.4976992 b
6 0.7176185 b
7 0.9919061 c
8 0.3800352 c
9 0.7774452 c

## this works
dat %>% dplyr::group_by( g ) %>% do( data.frame(x=tail(sort(.$x),2)) )

## this works too
dat %>% dplyr::group_by( g ) %>% do( .[tail(order(.$x),2),] )

          x      g
      (dbl) (fctr)
1 0.3841037      a
2 0.6870228      a
3 0.7176185      b
4 0.7698414      b
5 0.7774452      c
6 0.9919061      c

## no error, but x is treated as a 1x1 data frame
dat %>% dplyr::group_by( g ) %>% do( x=tail(sort(.$x),2) )
       g        x
  (fctr)    (chr)
1      a <dbl[2]>
2      b <dbl[2]>
3      c <dbl[2]>

## you need a function to do more complicated stuff 
top2x <- function(df) { df[tail(order(df$x),2),] }
dat %>% dplyr::group_by( g ) %>% do( top2x(.) )
like image 163
YH Wu Avatar answered Oct 20 '22 04:10

YH Wu


Turning my comment into an answer..

Yes, dplyr offers a way to create data.frames for each group. Using the do operator on a grouped data.frame / tbl will let you do this, more precisely, it lets you apply arbitrary functions to each group. This is documented in the help file for do:

[...] You can use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list. This is particularly useful when working with models: you can fit models per group with do and then flexibly extract components with either another do or summarise.

My experience so far is that whenever it is possible to use one of the specialised dplyr functions like mutate / summarise / mutate_each / etc., they should be preferred over do, because they are often more efficient than the use of do, but of course not as flexible.

like image 36
talat Avatar answered Oct 20 '22 04:10

talat