Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the top values by group

Here's a sample data frame:

d <- data.frame(   x   = runif(90),   grp = gl(3, 30) )  

I want the subset of d containing the rows with the top 5 values of x for each value of grp.

Using base-R, my approach would be something like:

ordered <- d[order(d$x, decreasing = TRUE), ]     splits <- split(ordered, ordered$grp) heads <- lapply(splits, head) do.call(rbind, heads) ##              x grp ## 1.19 0.8879631   1 ## 1.4  0.8844818   1 ## 1.12 0.8596197   1 ## 1.26 0.8481809   1 ## 1.18 0.8461516   1 ## 1.29 0.8317092   1 ## 2.31 0.9751049   2 ## 2.34 0.9269764   2 ## 2.57 0.8964114   2 ## 2.58 0.8896466   2 ## 2.45 0.8888834   2 ## 2.35 0.8706823   2 ## 3.74 0.9884852   3 ## 3.73 0.9837653   3 ## 3.83 0.9375398   3 ## 3.64 0.9229036   3 ## 3.69 0.8021373   3 ## 3.86 0.7418946   3 

Using dplyr, I expected this to work:

d %>%   arrange_(~ desc(x)) %>%   group_by_(~ grp) %>%   head(n = 5) 

but it only returns the overall top 5 rows.

Swapping head for top_n returns the whole of d.

d %>%   arrange_(~ desc(x)) %>%   group_by_(~ grp) %>%   top_n(n = 5) 

How do I get the correct subset?

like image 846
Richie Cotton Avatar asked Jan 04 '15 13:01

Richie Cotton


People also ask

How do you filter top value in R?

To get the top values in an R data frame, we can use the head function and if we want the values in decreasing order then sort function will be required. Therefore, we need to use the combination of head and sort function to find the top values in decreasing order.


1 Answers

From dplyr 1.0.0, "slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n()."

d %>% group_by(grp) %>% slice_max(order_by = x, n = 5) # # A tibble: 15 x 2 # # Groups:   grp [3] #     x grp   # <dbl> <fct> #  1 0.994 1     #  2 0.957 1     #  3 0.955 1     #  4 0.940 1     #  5 0.900 1     #  6 0.963 2     #  7 0.902 2     #  8 0.895 2     #  9 0.858 2     # 10 0.799 2     # 11 0.985 3     # 12 0.893 3     # 13 0.886 3     # 14 0.815 3     # 15 0.812 3 

Pre-dplyr 1.0.0 using top_n:

From ?top_n, about the wt argument:

The variable to use for ordering [...] defaults to the last variable in the tbl".

The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x.

d %>%   group_by(grp) %>%   top_n(n = 5, wt = x) 

Data:

set.seed(123) d <- data.frame(   x = runif(90),   grp = gl(3, 30)) 
like image 150
Henrik Avatar answered Sep 28 '22 20:09

Henrik