Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: max value in a group, excluding the value in each row?

I have a data frame that looks as follows:

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]

  g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4

I know how to add a column with the maximum x value for each group g:

> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g

  g x x_max
1 A 7     7
2 A 3     7
3 B 5     9
4 B 9     9
5 B 2     9
6 C 4     4

But what I would like is to get is the maximum x value for each group g, excluding the x value in each row.

For the given example, the desired output would look like this:

Source: local data frame [6 x 3]
Groups: g

  g x x_max x_max_exclude
1 A 7     7             3
2 A 3     7             7
3 B 5     9             9
4 B 9     9             5
5 B 2     9             9
6 C 4     4            NA

I thought I might be able to use row_number() to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf output:

> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g

  g x x_max r x_max_exclude
1 A 7     7 1          -Inf
2 A 3     7 2          -Inf
3 B 5     9 1          -Inf
4 B 9     9 2          -Inf
5 B 2     9 3          -Inf
6 C 4     4 1          -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf

What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number() doesn't work would also be much appreciated. Thanks for the help.

like image 628
Eric Avatar asked Jun 11 '15 01:06

Eric


People also ask

What is dplyr in R?

These are the selected row with the maximum value in each group. dplyr is a n R package which is most commonly used to manipulate the data frame. dplyr provides various verbs (functions) for data manipulation such as filter, arrange, select, rename, mutate etc.

How do I find the maximum value of each group in R?

Often you may want to find the maximum value of each group in a data frame in R. Fortunately this is easy to do using functions from the dplyr package. The following code shows how to find the max value by team and position: The following code shows how to return the rows that contain the max value by team and position:

How do I find the maximum and lowest values of each group?

Now, we can use the group_by and the top_n functions to find the highest and lowest numeric values of each group: The RStudio console is showing the result of the previous R syntax: The maxima of A, B, and C are 2, 5, and 10, respectively. UPDATE: Note that top_n has been superseded in favor of slice_min ()/slice_max ().

How to select the row with max value in each group?

Step 1: Load the dataset into a variable (group). Step 2: Sorted the marks in descending order for each group (A, B, C). As our sub is now in ascending order, and we are ready to select the row with max value in each group, here groups are A, B, C. Step 3: Remove the duplicate rows from the sorted subject column.


2 Answers

You could try:

df %>% 
  group_by(g) %>% 
  arrange(desc(x)) %>% 
  mutate(max = ifelse(x == max(x), x[2], max(x)))

Which gives:

#Source: local data frame [6 x 3]
#Groups: g
#
#  g x max
#1 A 7   3
#2 A 3   7
#3 B 9   5
#4 B 5   9
#5 B 2   9
#6 C 4  NA

Benchmark

I've tried the solutions so far on the benchmark:

df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
                 x = sample(1:10, 10e5, replace = TRUE))

library(microbenchmark)

mbm <- microbenchmark(
  steven = df %>% 
    group_by(g) %>% 
    arrange(desc(x)) %>% 
    mutate(max = ifelse(x == max(x), x[2], max(x))),
  eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
           x_max2 = sort(x, decreasing = TRUE)[2], 
           x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2),
  arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
  times = 50
)

@Arun's data.table solution is the fastest:

# Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval cld
#  steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448    50  b 
#    eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170    50   c
#    arun  44.48639  46.17961  54.65824  47.74142  48.9884 102.3830    50 a   

enter image description here

like image 174
Steven Beaupré Avatar answered Oct 13 '22 23:10

Steven Beaupré


Interesting problem. Here's one way using data.table:

require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]

The idea is to order by column x and on those indices, we group by g. Since we've the ordered indices, for the first .N-1 rows, the max value is the value at .N. And for the .Nth row, it's the value at .N-1th row.

.N is a special variable that holds the number of observations in each group.

I'll leave it to you and/or the dplyr experts to translate this (or answer with another approach).

like image 44
Arun Avatar answered Oct 14 '22 01:10

Arun