Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I collapse a dataframe by some variables, taking mean across others

I need to summarize a data frame by some variables, ignoring the others. This is sometimes referred to as collapsing. E.g. if I have a dataframe like this:

Widget Type Energy  
egg 1 20  
egg 2 30  
jap 3 50  
jap 1 60

Then collapsing by Widget, with Energy the dependent variable, Energy~Widget, would yield

Widget Energy  
egg  25  
jap  55  

In Excel the closest functionality might be "Pivot tables" and I've worked out how to do it in python ( http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/), and here's an example with R using doBy library to do something very related ( http://www.mail-archive.com/[email protected]/msg02643.html), but is there an easy way to do the above? And even better is there anything built into the ggplot2 library to create plots that collapse across some variables?

like image 505
Alex Holcombe Avatar asked Apr 01 '10 04:04

Alex Holcombe


People also ask

What does it mean to collapse across a variable?

Collapsing means using one or several grouping variables to find summary statistics — mean, median, etc. — for different categories in your data.

How do I group by and sum in R?

Group By Sum in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by sum in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df results to get the group by sum.

How does collapse work in Stata?

collapse takes the dataset in memory and creates a new dataset containing summary statistics of the original data. collapse adds meaningful variable labels to the variables in this new dataset.


1 Answers

Use aggregate to summarize across a factor:

> df<-read.table(textConnection('
+ egg 1 20
+ egg 2 30
+ jap 3 50
+ jap 1 60'))
> aggregate(df$V3,list(df$V1),mean)
  Group.1  x
1     egg 25
2     jap 55

For more flexibility look at the tapply function and the plyr package.

In ggplot2 use stat_summary to summarize

qplot(V1,V3,data=df,stat="summary",fun.y=mean,geom='bar',width=0.4)
like image 134
Jyotirmoy Bhattacharya Avatar answered Nov 03 '22 23:11

Jyotirmoy Bhattacharya