I need to summarize a data frame by some variables, ignoring the others. This is sometimes referred to as collapsing. E.g. if I have a dataframe like this:
Widget Type Energy
egg 1 20
egg 2 30
jap 3 50
jap 1 60
Then collapsing by Widget, with Energy the dependent variable, Energy~Widget, would yield
Widget Energy
egg 25
jap 55
In Excel the closest functionality might be "Pivot tables" and I've worked out how to do it in python ( http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/), and here's an example with R using doBy library to do something very related ( http://www.mail-archive.com/[email protected]/msg02643.html), but is there an easy way to do the above? And even better is there anything built into the ggplot2 library to create plots that collapse across some variables?
Collapsing means using one or several grouping variables to find summary statistics — mean, median, etc. — for different categories in your data.
Group By Sum in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by sum in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df results to get the group by sum.
collapse takes the dataset in memory and creates a new dataset containing summary statistics of the original data. collapse adds meaningful variable labels to the variables in this new dataset.
Use aggregate
to summarize across a factor:
> df<-read.table(textConnection('
+ egg 1 20
+ egg 2 30
+ jap 3 50
+ jap 1 60'))
> aggregate(df$V3,list(df$V1),mean)
Group.1 x
1 egg 25
2 jap 55
For more flexibility look at the tapply
function and the plyr
package.
In ggplot2
use stat_summary
to summarize
qplot(V1,V3,data=df,stat="summary",fun.y=mean,geom='bar',width=0.4)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With