I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).
For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?
Looking at mtcars:
library(car)
Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":
df1 <- mtcars %.% group_by(cyl, gear) %.% summarise( newvar = sum(wt) )
Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":
df2 <- df1 %.% group_by(cyl) %.% mutate( newvar2 = newvar + 5 )
Still yields an ungrouped output:
cyl gear newvar newvar2 1 6 3 6.675 11.675 2 4 4 19.025 24.025 3 6 4 12.375 17.375 4 6 5 2.770 7.770 5 4 3 2.465 7.465 6 8 3 49.249 54.249 7 4 5 3.653 8.653 8 8 5 6.740 11.740
Am I doing something wrong with the syntax?
Edit:
If I were to do this with plyr and ddply:
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
and then to get the second df:
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...
The group_by() method is used to group the data contained in the data frame based on the columns specified as arguments to the function call.
Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group".
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.
I had a similar problem. I found that simply detaching plyr
solved it:
detach(package:plyr) library(dplyr)
Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use
mtcars %>% group_by(cyl, gear) %>% summarise(newvar = sum(wt)) %>% summarise(newvar2 = sum(newvar) + 5)
Note that this will give a different answer if you use group_by(gear, cyl)
in the second line.
And to get your first attempt working:
df1 <- mtcars %>% group_by(cyl, gear) %>% summarise(newvar = sum(wt)) df2 <- df1 %>% group_by(cyl) %>% summarise(newvar2 = sum(newvar)+5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With