I have a data frame that looks as follows: <pre class="prettyprint"><code>> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4)) > df Source: local data frame [6 x 2] g x 1 A 7 2 A 3 3 B 5 4 B 9 5 B 2 6 C 4 </code></pre> I know how to add a column with the maximum <code>x</code> value for each group <code>g</code>: <pre class="prettyprint"><code>> df %>% group_by(g) %>% mutate(x_max = max(x)) Source: local data frame [6 x 3] Groups: g g x x_max 1 A 7 7 2 A 3 7 3 B 5 9 4 B 9 9 5 B 2 9 6 C 4 4 </code></pre> But what I would like is to get is the maximum <code>x</code> value for each group <code>g</code>, excluding the <code>x</code> value in each row. For the given example, the desired output would look like this: <pre class="prettyprint"><code>Source: local data frame [6 x 3] Groups: g g x x_max x_max_exclude 1 A 7 7 3 2 A 3 7 7 3 B 5 9 9 4 B 9 9 5 5 B 2 9 9 6 C 4 4 NA </code></pre> I thought I might be able to use <code>row_number()</code> to remove particular elements and take the max of what remained, but hit warning messages and got incorrect <code>-Inf</code> output: <pre class="prettyprint"><code>> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r])) Source: local data frame [6 x 5] Groups: g g x x_max r x_max_exclude 1 A 7 7 1 -Inf 2 A 3 7 2 -Inf 3 B 5 9 1 -Inf 4 B 9 9 2 -Inf 5 B 2 9 3 -Inf 6 C 4 4 1 -Inf Warning messages: 1: In max(c(4, 9, 2)[-1:3]) : no non-missing arguments to max; returning -Inf 2: In max(c(4, 9, 2)[-1:3]) : no non-missing arguments to max; returning -Inf 3: In max(c(4, 9, 2)[-1:3]) : no non-missing arguments to max; returning -Inf </code></pre> What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using <code>row_number()</code> doesn't work would also be much appreciated. Thanks for the help.

You could try: <pre class="prettyprint"><code>df %>% group_by(g) %>% arrange(desc(x)) %>% mutate(max = ifelse(x == max(x), x[2], max(x))) </code></pre> Which gives: <pre class="prettyprint"><code>#Source: local data frame [6 x 3] #Groups: g # # g x max #1 A 7 3 #2 A 3 7 #3 B 9 5 #4 B 5 9 #5 B 2 9 #6 C 4 NA </code></pre> <hr> Benchmark I've tried the solutions so far on the benchmark: <pre class="prettyprint"><code>df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE), x = sample(1:10, 10e5, replace = TRUE)) library(microbenchmark) mbm <- microbenchmark( steven = df %>% group_by(g) %>% arrange(desc(x)) %>% mutate(max = ifelse(x == max(x), x[2], max(x))), eric = df %>% group_by(g) %>% mutate(x_max = max(x), x_max2 = sort(x, decreasing = TRUE)[2], x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% select(-x_max2), arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g], times = 50 ) </code></pre> @Arun's data.table solution is the fastest: <pre class="prettyprint"><code># Unit: milliseconds # expr min lq mean median uq max neval cld # steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b # eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c # arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a </code></pre> <img src="https://i.stack.imgur.com/NBtvm.png" alt="enter image description here">

Interesting problem. Here's one way using <code>data.table</code>: <pre class="prettyprint"><code>require(data.table) setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g] </code></pre> The idea is to order by column <code>x</code> and on those indices, we group by <code>g</code>. Since we've the ordered indices, for the first <code>.N-1</code> rows, the max value is the value at <code>.N</code>. And for the <code>.N</code>th row, it's the value at <code>.N-1</code>th row. <code>.N</code> is a special variable that holds the number of observations in each group. I'll leave it to you and/or the <code>dplyr</code> experts to translate this (or answer with another approach).

dplyr: max value in a group, excluding the value in each row?

Tags:

r

greatest-n-per-group

dplyr

I have a data frame that looks as follows:

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]

  g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4

I know how to add a column with the maximum x value for each group g:

> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g

  g x x_max
1 A 7     7
2 A 3     7
3 B 5     9
4 B 9     9
5 B 2     9
6 C 4     4

But what I would like is to get is the maximum x value for each group g, excluding the x value in each row.

For the given example, the desired output would look like this:

Source: local data frame [6 x 3]
Groups: g

  g x x_max x_max_exclude
1 A 7     7             3
2 A 3     7             7
3 B 5     9             9
4 B 9     9             5
5 B 2     9             9
6 C 4     4            NA

I thought I might be able to use row_number() to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf output:

> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g

  g x x_max r x_max_exclude
1 A 7     7 1          -Inf
2 A 3     7 2          -Inf
3 B 5     9 1          -Inf
4 B 9     9 2          -Inf
5 B 2     9 3          -Inf
6 C 4     4 1          -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf

What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number() doesn't work would also be much appreciated. Thanks for the help.

628

asked Jun 11 '15 01:06

Eric

2 Answers

You could try:

df %>% 
  group_by(g) %>% 
  arrange(desc(x)) %>% 
  mutate(max = ifelse(x == max(x), x[2], max(x)))

Which gives:

#Source: local data frame [6 x 3]
#Groups: g
#
#  g x max
#1 A 7   3
#2 A 3   7
#3 B 9   5
#4 B 5   9
#5 B 2   9
#6 C 4  NA

Benchmark

I've tried the solutions so far on the benchmark:

df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
                 x = sample(1:10, 10e5, replace = TRUE))

library(microbenchmark)

mbm <- microbenchmark(
  steven = df %>% 
    group_by(g) %>% 
    arrange(desc(x)) %>% 
    mutate(max = ifelse(x == max(x), x[2], max(x))),
  eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
           x_max2 = sort(x, decreasing = TRUE)[2], 
           x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2),
  arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
  times = 50
)

@Arun's data.table solution is the fastest:

# Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval cld
#  steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448    50  b 
#    eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170    50   c
#    arun  44.48639  46.17961  54.65824  47.74142  48.9884 102.3830    50 a

enter image description here

174

answered Oct 13 '22 23:10

Steven Beaupré

Interesting problem. Here's one way using data.table:

require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]

The idea is to order by column x and on those indices, we group by g. Since we've the ordered indices, for the first .N-1 rows, the max value is the value at .N. And for the .Nth row, it's the value at .N-1th row.

.N is a special variable that holds the number of observations in each group.

I'll leave it to you and/or the dplyr experts to translate this (or answer with another approach).

answered Oct 14 '22 01:10

Arun

Related questions
                            
                                Add to a list in Shiny
                            
                                R using temporary options settings inside a function
                            
                                mainPanel argument missing in ui.r file using R's Shiny package
                            
                                Right-align elements in Shiny mainPanel
                            
                                ggplot2's mpg dataset--what's the "fl." variable?
                            
                                What arguments can I pass to dbConnect?
                            
                                Why a self-written Rcpp vectorized mathematical function is faster than its base counterpart?
                            
                                How to save an object through GGally in R
                            
                                What is unary plus / minus in R?
                            
                                R check doesn't like std:cout (C++)
                            
                                Adding double quotes to string in R
                            
                                R error in '[<-.data.frame'... replacement has # items, need #
                            
                                R: Setting limits to scale_x_yearqtr in ggplot for yearqtr (zoo)
                            
                                Calculate number of days between two dates in r
                            
                                How to split a decimal number from a string in R
                            
                                How to plot, where each row in a matrix is a line inte plot in R
                            
                                How to install RHadoop packages (Rmr, Rhdfs, Rhbase)?
                            
                                R: Combine list of data frames into single data frame, add column with list index
                            
                                Simultaneous order, row-filter and column-select with data.table
                            
                                r caret predict returns fewer output than input

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With