Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day: <pre class="prettyprint"><code>daily fraction = revenue for product i on day d / sum or revenue for all products on day d </code></pre> My current strategy is to summarise and join: <pre class="prettyprint"><code>library(dplyr) join_summary <- function(data, ...) left_join(data, summarise(data, ...)) data = data.frame( day = c(1,1,2,2,3,3), product = rep(c("A", "B"), 3), revenue = c(2, 4, 8, 7, 9, 2) ) data2 <- data %>% group_by(day) %>% join_summary(daily_revenue = sum(revenue)) %>% mutate(revenue_fraction = revenue / daily_revenue) </code></pre> This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are: <ul> <li>Is my current way of doing it OK?</li> <li>Is there any better way of doing it, preferably using tools from <code>dplyr</code> or the wider Hadleyverse?</li> <li>Do I really need my custom function <code>join_summary</code>, or can it be done with the existing <code>dplyr</code> verbs? (I prefer to stay within the "pipe flow", though.)</li> </ul>

Besides using <code>summarise</code> to aggregate, you can use <code>mutate</code> to assign summary stats to the full column: <pre class="prettyprint"><code>data %>% group_by(day) %>% mutate( daily_revenue = sum(revenue), revenue_fraction = revenue / daily_revenue ) </code></pre> which gives <pre class="prettyprint"><code>Source: local data frame [6 x 5] Groups: day [3] day product revenue daily_revenue revenue_fraction <dbl> <fctr> <dbl> <dbl> <dbl> 1 1 A 2 6 0.3333333 2 1 B 4 6 0.6666667 3 2 A 8 15 0.5333333 4 2 B 7 15 0.4666667 5 3 A 9 11 0.8181818 6 3 B 2 11 0.1818182 </code></pre> This works because the value <code>sum(revenue)</code> is recycled to fill in all rows in the group.

How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day:

daily fraction = revenue for product i on day d / sum or revenue for all products on day d

My current strategy is to summarise and join:

library(dplyr)

join_summary <- function(data, ...) left_join(data, summarise(data, ...))

data = data.frame(
  day = c(1,1,2,2,3,3),
  product = rep(c("A", "B"), 3),
  revenue = c(2, 4, 8, 7, 9, 2)
)

data2 <- data %>%
  group_by(day) %>%
  join_summary(daily_revenue = sum(revenue)) %>%
  mutate(revenue_fraction = revenue / daily_revenue)

This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are:

Is my current way of doing it OK?
Is there any better way of doing it, preferably using tools from dplyr or the wider Hadleyverse?
Do I really need my custom function join_summary, or can it be done with the existing dplyr verbs? (I prefer to stay within the "pipe flow", though.)

Which dplyr function is used to add new columns based on existing values?

You can use the mutate() function from the dplyr package to add one or more columns to a data frame in R.

How do I sum a column in dplyr?

Syntax: mutate(new-col-name = rowSums(.)) The rowSums() method is used to calculate the sum of each row and then append the value at the end of each row under the new column name specified.

Besides using summarise to aggregate, you can use mutate to assign summary stats to the full column:

data %>% 
  group_by(day) %>% 
  mutate(
    daily_revenue = sum(revenue), 
    revenue_fraction = revenue / daily_revenue
  )

which gives

Source: local data frame [6 x 5]
Groups: day [3]

    day product revenue daily_revenue revenue_fraction
  <dbl>  <fctr>   <dbl>         <dbl>            <dbl>
1     1       A       2             6        0.3333333
2     1       B       4             6        0.6666667
3     2       A       8            15        0.5333333
4     2       B       7            15        0.4666667
5     3       A       9            11        0.8181818
6     3       B       2            11        0.1818182

This works because the value sum(revenue) is recycled to fill in all rows in the group.

How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Tags:

join

r

aggregate

dplyr

summary

Anders

People also ask

1 Answers

Frank

Recent Activity

Donate For Us

How to calculate new column depending on aggregate function on group using dplyr (add summary statistics on the summary statistics)?

Tags:

join

r

aggregate

dplyr

summary

Anders

People also ask

1 Answers

Frank

Related questions

Recent Activity

Donate For Us