Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day:

daily fraction = revenue for product i on day d / sum or revenue for all products on day d

My current strategy is to summarise and join:

library(dplyr)

join_summary <- function(data, ...) left_join(data, summarise(data, ...))

data = data.frame(
  day = c(1,1,2,2,3,3),
  product = rep(c("A", "B"), 3),
  revenue = c(2, 4, 8, 7, 9, 2)
)

data2 <- data %>%
  group_by(day) %>%
  join_summary(daily_revenue = sum(revenue)) %>%
  mutate(revenue_fraction = revenue / daily_revenue)

This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are:

  • Is my current way of doing it OK?
  • Is there any better way of doing it, preferably using tools from dplyr or the wider Hadleyverse?
  • Do I really need my custom function join_summary, or can it be done with the existing dplyr verbs? (I prefer to stay within the "pipe flow", though.)
like image 738
Anders Avatar asked Feb 13 '17 21:02

Anders


People also ask

Which dplyr function is used to add new columns based on existing values?

You can use the mutate() function from the dplyr package to add one or more columns to a data frame in R.

How do I sum a column in dplyr?

Syntax: mutate(new-col-name = rowSums(.)) The rowSums() method is used to calculate the sum of each row and then append the value at the end of each row under the new column name specified.


1 Answers

Besides using summarise to aggregate, you can use mutate to assign summary stats to the full column:

data %>% 
  group_by(day) %>% 
  mutate(
    daily_revenue = sum(revenue), 
    revenue_fraction = revenue / daily_revenue
  )

which gives

Source: local data frame [6 x 5]
Groups: day [3]

    day product revenue daily_revenue revenue_fraction
  <dbl>  <fctr>   <dbl>         <dbl>            <dbl>
1     1       A       2             6        0.3333333
2     1       B       4             6        0.6666667
3     2       A       8            15        0.5333333
4     2       B       7            15        0.4666667
5     3       A       9            11        0.8181818
6     3       B       2            11        0.1818182

This works because the value sum(revenue) is recycled to fill in all rows in the group.

like image 165
Frank Avatar answered Sep 21 '22 12:09

Frank