Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by ID and filter only the group that has maximum mean

Tags:

dataframe

r

dplyr

I have a DF as follows,

a <- data.frame(group =c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5), count = c(12L, 80L, 102L, 97L, 118L, 115L, 4L, 13L, 136L,114L, 134L, 126L, 128L, 63L, 118L, 1L, 28L, 18L, 18L, 23L))

   group count
1      1    12
2      1    80
3      1   102
4      1    97
5      2   118
6      2   115
7      2     4
8      2    13
9      3   136
10     3   114
11     3   134
12     3   126
13     4   128
14     4    63
15     4   118
16     4     1
17     5    28
18     5    18
19     5    18
20     5    23

I used the following command,

a %>% group_by(group) %>% summarise(mean(count))

  group mean(count)
  (dbl)       (dbl)
1     1       72.75
2     2       62.50
3     3      127.50
4     4       77.50
5     5       21.75

I want to filter out the entries of the group that belong to the highest mean. say here the third group contains the maximum mean, so my output should be,

   group count
1     3   136
2     3   114
3     3   134
4     3   126

Can anybody give some idea how to do this?

like image 806
haimen Avatar asked Dec 08 '22 22:12

haimen


2 Answers

In case you want to see a base R solution, you can do this using which.max and aggregate:

# calculate means by group
myMeans <- aggregate(count~group, a, FUN=mean)

# select the group with the max mean
maxMeanGroup <- a[a$group == myMeans[which.max(myMeans$count),]$group, ]

As a second method, you might try data.table:

library(data.table)
setDT(a)

a[group == a[, list("count"=mean(count)), by=group
             ][, which.max(count)], ]

which returns

   group count
1:     3   136
2:     3   114
3:     3   134
4:     3   126
like image 69
lmo Avatar answered Apr 02 '23 18:04

lmo


You'll want to mutate instead of summarize so you can keep all observations in your data.frame.

new_data <- a %>% group_by(group) %>% 
  ##compute average count within groups
  mutate(AvgCt = mean(count)) %>% 
  ungroup() %>% 
  ##filter, looking for the maximum of the created variable
  filter(AvgCt == max(AvgCt))

Then you have the final output

> new_data
Source: local data frame [4 x 3]

  group count AvgCt
  (dbl) (int) (dbl)
1     3   136 127.5
2     3   114 127.5
3     3   134 127.5
4     3   126 127.5

And, if you prefer to remove the computed variable,

new_data <- new_data %>% select(-AvgCt)

> new_data
Source: local data frame [4 x 2]

  group count
  (dbl) (int)
1     3   136
2     3   114
3     3   134
4     3   126
like image 37
BarkleyBG Avatar answered Apr 02 '23 19:04

BarkleyBG