I have a data frame of the form:
  Family Code Length Type
1      A    1     11 Alpha
2      A    3      8 Beta
3      A    3      9 Beta
4      B    4      7 Alpha
5      B    5      8 Alpha
6      C    6      2 Beta
7      C    6      5 Beta
8      C    6      4 Beta
I would like to reduce the data set to one containing unique values of Code by taking a mean of Length values, but to retain all string variables too, i.e.
  Family Code Length Type
1      A    1     11 Alpha
2      A    3    8.5 Beta
3      B    4      7 Alpha
5      B    5      8 Alpha
6      C    6   3.67 Beta
I've tried aggregate() and ddply() but these seem to replace strings with NA and I'm struggling to find a way round this.
aggregate() function is used to get the summary statistics of the data by group. The statistics include mean, min, sum.
Since Family and Type are constant within a Code group, you can "group" on those as well without changing anything when you use ddply.  If your original data set was dat
ddply(dat, .(Family, Code, Type), summarize, Length=mean(Length))
gives
  Family Code  Type    Length
1      A    1 Alpha 11.000000
2      A    3  Beta  8.500000
3      B    4 Alpha  7.000000
4      B    5 Alpha  8.000000
5      C    6  Beta  3.666667
If Family and Type are not constant within a Code group, then you would need to define how to summarize/aggregate those values.  In this example, I just take the single unique value:
ddply(dat, .(Code), summarize, Family=unique(Family), 
  Length=mean(Length), Type=unique(Type))
Similar options using dplyr are
 library(dplyr)
 dat %>% 
     group_by(Family, Code, Type) %>%
     summarise(Length=mean(Length))
and
  dat %>%
     group_by(Code) %>%
     summarise(Family=unique(Family), Length=mean(Length), Type=unique(Type))
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With