I have a data frame of the form:
Family Code Length Type
1 A 1 11 Alpha
2 A 3 8 Beta
3 A 3 9 Beta
4 B 4 7 Alpha
5 B 5 8 Alpha
6 C 6 2 Beta
7 C 6 5 Beta
8 C 6 4 Beta
I would like to reduce the data set to one containing unique values of Code by taking a mean of Length values, but to retain all string variables too, i.e.
Family Code Length Type
1 A 1 11 Alpha
2 A 3 8.5 Beta
3 B 4 7 Alpha
5 B 5 8 Alpha
6 C 6 3.67 Beta
I've tried aggregate() and ddply() but these seem to replace strings with NA and I'm struggling to find a way round this.
aggregate() function is used to get the summary statistics of the data by group. The statistics include mean, min, sum.
Since Family
and Type
are constant within a Code
group, you can "group" on those as well without changing anything when you use ddply
. If your original data set was dat
ddply(dat, .(Family, Code, Type), summarize, Length=mean(Length))
gives
Family Code Type Length
1 A 1 Alpha 11.000000
2 A 3 Beta 8.500000
3 B 4 Alpha 7.000000
4 B 5 Alpha 8.000000
5 C 6 Beta 3.666667
If Family
and Type
are not constant within a Code
group, then you would need to define how to summarize/aggregate those values. In this example, I just take the single unique value:
ddply(dat, .(Code), summarize, Family=unique(Family),
Length=mean(Length), Type=unique(Type))
Similar options using dplyr
are
library(dplyr)
dat %>%
group_by(Family, Code, Type) %>%
summarise(Length=mean(Length))
and
dat %>%
group_by(Code) %>%
summarise(Family=unique(Family), Length=mean(Length), Type=unique(Type))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With