I had a problem today figuring out a way to do an aggregation in dplyr
in R but for some reason was unable to come up with a solution (although I think this should be quite easy).
I have a data set like this:
structure(list(date = structure(c(16431, 16431, 16431, 16432,
16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green",
"red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L,
3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle"
), class = "factor"), value = c(100, 130, 100, 180, 125, 190,
120, 100, 140)), .Names = c("date", "colour", "shape", "value"
), row.names = c(NA, -9L), class = "data.frame")
which shows like this:
date colour shape value
1 2014-12-27 red square 100
2 2014-12-27 blue square 130
3 2014-12-27 blue triangle 100
4 2014-12-28 green triangle 180
5 2014-12-28 green triangle 125
6 2014-12-28 red square 190
7 2014-12-29 red circle 120
8 2014-12-29 blue circle 100
9 2014-12-29 blue circle 140
My goal is to calculate the most frequent colour, shape and the mean value per day. My expected output is the following:
date colour shape value
1 27/12/2014 blue square 110
2 28/12/2014 green triangle 165
3 29/12/2014 blue circle 120
I ended up doing it using split
and writing my own function to calculate the above for a data.frame
, then used snow::clusterApply
to run it in parallel. It was efficient enough (my original dataset is about 10M rows long) but I am wondering whether this can happen in one chain using dplyr
. Efficiency is really important for this so being able to run it in one chain is quite important.
You could do
dat %>% group_by(date) %>%
summarize(colour = names(which.max(table(colour))),
shape = names(which.max(table(shape))),
value = mean(value))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With