I'm just starting withddply
and finding it very useful. I want to summarize a data frame and also get rid of some rows in the final output based on whether the summarized column has a particular value. This is like HAVING
as well as GROUP BY
in SQL. Here's an example:
input = data.frame(id= c( 1, 1, 2, 2, 3, 3),
metric= c(30,50,70,90,40,1050),
badness=c( 1, 5, 7, 3, 3, 99))
intermediateoutput = ddply(input, ~ id, summarize,
meanMetric=mean(metric),
maxBadness=max(badness))
intermediateoutput[intermediateoutput$maxBadness < 50,1:2]
This gives:
id meanMetric
1 1 40
2 2 80
which is what I want, but can I do it in a single step within the ddply
statement somehow?
You should try with dplyr
. It is faster, and the code is much easier to read and understand, especially if you use pipes (%>%
) :
input %>%
group_by(id) %>%
summarize(meanMetric=mean(metric), maxBadness=max(badness)) %>%
filter(maxBadness <50) %>%
select(-maxBadness)
Following @Arun comment, you can simplify the code this way :
input %>%
group_by(id) %>%
filter(max(badness)<50) %>%
summarize(meanMetric=mean(metric))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With