I was very shocked by the smoothness of dplyr package in flow-style data processing. Recently I rush into a problem to generate a new data frame for each group ID and combine those small data frames into a final larger data frame. A toy example:
input.data.frame %>%
group_by(gid) %>%
{some operation to generate a new data frame for each group} ## FAILED!!!!
In dplyr, the function mutate
adding new column to each group and summarise
generating summaries for each group, both can not fulfill my requirement. (Did I miss something?)
Alternatively, using ddply
of plyr package, the previous interation of dplyr, I can make it via
ddply(input.data.frame, .(gid), function(x) {
some operation to generate a new data frame for each group
}
But the shortage is some funcs in dplyr will be masked from availableness when I load the plyr package.
Here is an example following the answer by G. Grothendieck to a similar question. Adding rows in `dplyr` output
First we generate a data frame with x and g. There are 9 random numbers in x and 3 groups a,b,c in g. We want to select 2 largest numbers from each group. It is important to remember that do requires a data frame as return value.
library(dplyr)
set.seed(1)
dat <- data.frame(x=runif(9),g=rep(letters[1:3],each=3))
dat
x g
1 0.1765568 a
2 0.6870228 a
3 0.3841037 a
4 0.7698414 b
5 0.4976992 b
6 0.7176185 b
7 0.9919061 c
8 0.3800352 c
9 0.7774452 c
## this works
dat %>% dplyr::group_by( g ) %>% do( data.frame(x=tail(sort(.$x),2)) )
## this works too
dat %>% dplyr::group_by( g ) %>% do( .[tail(order(.$x),2),] )
x g
(dbl) (fctr)
1 0.3841037 a
2 0.6870228 a
3 0.7176185 b
4 0.7698414 b
5 0.7774452 c
6 0.9919061 c
## no error, but x is treated as a 1x1 data frame
dat %>% dplyr::group_by( g ) %>% do( x=tail(sort(.$x),2) )
g x
(fctr) (chr)
1 a <dbl[2]>
2 b <dbl[2]>
3 c <dbl[2]>
## you need a function to do more complicated stuff
top2x <- function(df) { df[tail(order(df$x),2),] }
dat %>% dplyr::group_by( g ) %>% do( top2x(.) )
Turning my comment into an answer..
Yes, dplyr offers a way to create data.frames for each group. Using the do
operator on a grouped data.frame / tbl will let you do this, more precisely, it lets you apply arbitrary functions to each group. This is documented in the help file for do
:
[...] You can use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list. This is particularly useful when working with models: you can fit models per group with do and then flexibly extract components with either another do or summarise.
My experience so far is that whenever it is possible to use one of the specialised dplyr functions like mutate / summarise / mutate_each / etc., they should be preferred over do
, because they are often more efficient than the use of do
, but of course not as flexible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With