Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple aggregations (categorical and numeric) with dplyr in one chain

Tags:

r

dplyr

I had a problem today figuring out a way to do an aggregation in dplyr in R but for some reason was unable to come up with a solution (although I think this should be quite easy).

I have a data set like this:

structure(list(date = structure(c(16431, 16431, 16431, 16432, 
16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green", 
"red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L, 
3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle"
), class = "factor"), value = c(100, 130, 100, 180, 125, 190, 
120, 100, 140)), .Names = c("date", "colour", "shape", "value"
), row.names = c(NA, -9L), class = "data.frame")

which shows like this:

        date colour    shape value
1 2014-12-27    red   square   100
2 2014-12-27   blue   square   130
3 2014-12-27   blue triangle   100
4 2014-12-28  green triangle   180
5 2014-12-28  green triangle   125
6 2014-12-28    red   square   190
7 2014-12-29    red   circle   120
8 2014-12-29   blue   circle   100
9 2014-12-29   blue   circle   140

My goal is to calculate the most frequent colour, shape and the mean value per day. My expected output is the following:

        date colour    shape value
1 27/12/2014   blue   square   110
2 28/12/2014  green triangle   165
3 29/12/2014   blue   circle   120

I ended up doing it using split and writing my own function to calculate the above for a data.frame, then used snow::clusterApply to run it in parallel. It was efficient enough (my original dataset is about 10M rows long) but I am wondering whether this can happen in one chain using dplyr. Efficiency is really important for this so being able to run it in one chain is quite important.

like image 299
LyzandeR Avatar asked Apr 23 '15 23:04

LyzandeR


1 Answers

You could do

dat %>% group_by(date) %>%
    summarize(colour = names(which.max(table(colour))),
              shape = names(which.max(table(shape))),
              value = mean(value))
like image 88
David Robinson Avatar answered Sep 21 '22 10:09

David Robinson