I have a data frame that looks as follows:
> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]
g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4
I know how to add a column with the maximum x
value for each group g
:
> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max
1 A 7 7
2 A 3 7
3 B 5 9
4 B 9 9
5 B 2 9
6 C 4 4
But what I would like is to get is the maximum x
value for each group g
, excluding the x
value in each row.
For the given example, the desired output would look like this:
Source: local data frame [6 x 3]
Groups: g
g x x_max x_max_exclude
1 A 7 7 3
2 A 3 7 7
3 B 5 9 9
4 B 9 9 5
5 B 2 9 9
6 C 4 4 NA
I thought I might be able to use row_number()
to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf
output:
> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g
g x x_max r x_max_exclude
1 A 7 7 1 -Inf
2 A 3 7 2 -Inf
3 B 5 9 1 -Inf
4 B 9 9 2 -Inf
5 B 2 9 3 -Inf
6 C 4 4 1 -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number()
doesn't work would also be much appreciated. Thanks for the help.
These are the selected row with the maximum value in each group. dplyr is a n R package which is most commonly used to manipulate the data frame. dplyr provides various verbs (functions) for data manipulation such as filter, arrange, select, rename, mutate etc.
Often you may want to find the maximum value of each group in a data frame in R. Fortunately this is easy to do using functions from the dplyr package. The following code shows how to find the max value by team and position: The following code shows how to return the rows that contain the max value by team and position:
Now, we can use the group_by and the top_n functions to find the highest and lowest numeric values of each group: The RStudio console is showing the result of the previous R syntax: The maxima of A, B, and C are 2, 5, and 10, respectively. UPDATE: Note that top_n has been superseded in favor of slice_min ()/slice_max ().
Step 1: Load the dataset into a variable (group). Step 2: Sorted the marks in descending order for each group (A, B, C). As our sub is now in ascending order, and we are ready to select the row with max value in each group, here groups are A, B, C. Step 3: Remove the duplicate rows from the sorted subject column.
You could try:
df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x)))
Which gives:
#Source: local data frame [6 x 3]
#Groups: g
#
# g x max
#1 A 7 3
#2 A 3 7
#3 B 9 5
#4 B 5 9
#5 B 2 9
#6 C 4 NA
Benchmark
I've tried the solutions so far on the benchmark:
df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
x = sample(1:10, 10e5, replace = TRUE))
library(microbenchmark)
mbm <- microbenchmark(
steven = df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x))),
eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
times = 50
)
@Arun's data.table solution is the fastest:
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b
# eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c
# arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a
Interesting problem. Here's one way using data.table
:
require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
The idea is to order by column x
and on those indices, we group by g
. Since we've the ordered indices, for the first .N-1
rows, the max value is the value at .N
. And for the .N
th row, it's the value at .N-1
th row.
.N
is a special variable that holds the number of observations in each group.
I'll leave it to you and/or the dplyr
experts to translate this (or answer with another approach).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With