Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filtering within the summarise function of dplyr

Tags:

r

dplyr

I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.

I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.

library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
                 tp=letters[1:14],
                 value_type=sample(LETTERS[1:3], 140, replace=TRUE),
                 values=runif(140))

df %>%
  group_by(id, tp) %>%
  summarise(
    all_mean=mean(values),
    A_mean=mean(values), # Only the values with value_type A
    value_count=sum(value_type == 'A')
  )

So the A_mean column should calculate the mean of values where value_count == 'A'.

I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.

Thanks in advance.

like image 678
drmariod Avatar asked Jun 29 '16 08:06

drmariod


People also ask

What does dplyr filter do?

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

How do I filter multiple values in R dplyr?

In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.

What does dplyr Summarise do?

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

What is the difference between the Group_by and filter function in R?

GROUP BY enables you to use aggregate functions on groups of data returned from a query. FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren't aggregated should be specified in a GROUP BY clause in the query.


2 Answers

We can try

 df %>%
     group_by(id, tp) %>%
     summarise(all_mean = mean(values), 
                A_mean = mean(values[value_type=="A"]),
                value_count=sum(value_type == 'A'))
like image 127
akrun Avatar answered Oct 18 '22 21:10

akrun


You can do this with two summary steps:

df %>%
  group_by(id, tp, value_type) %>%
  summarise(A_mean = mean(values)) %>%
  summarise(all_mean = mean(A_mean),
            A_mean = sum(A_mean * (value_type == "A")),
            value_count = sum(value_type == "A"))

The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"

like image 28
AlexR Avatar answered Oct 18 '22 21:10

AlexR